Neural networks will be the common language of the future for computer vision, according to Professor Jitendra Malik from UC Berkeley. Greg Blackman listened to his keynote at the Embedded Vision Summit in Santa Clara
(Credit: ktsdesign)
Neural networks will be the primary language of computer vision in the future, rather like English is the common language for the scientific community. At least that is the hope of Professor Jitendra Malik at the University of California at Berkeley – Malik was speaking at the Embedded Vision Summit, a computer vision conference organised by the Embedded Vision Alliance and held in Santa Clara, California from 1 to 3 May.
Half of the technical insight presentations at the conference focused on deep learning and neural networks, a branch of artificial intelligence where the algorithms are trained to recognise objects in a scene using large datasets, as opposed to the traditional method of writing an algorithm for a specific task.
Jeff Bier, the founder of the Embedded Vision Alliance, when introducing Malik for his keynote address at the conference, said that 70 per cent of vision developers surveyed by the Alliance were using neural networks, a huge shift compared to only three years ago at the 2014 summit when hardly anyone was using them.
Deep learning has also reached the industrial machine vision world to some extent, with the latest version of MVTec’s Halcon software running an OCR tool based on deep learning, and Vidi Systems, now owned by Cognex, offering a deep learning software suite for machine vision.
Malik went further than saying neural networks merely have their place in computer vision, to suggesting that deep learning could be used to unite different strands of computer vision. He gave the example of 3D vision, for which algorithms like simultaneous localisation and mapping (SLAM) have traditionally been used to model the world in 3D, and for which machine learning hasn’t been thought suitable. He said that the world of geometry – which techniques like SLAM fall into – and machine learning need to be brought together.
A human will view a chair, for instance, in 3D, as well as being informed by past experiences of other chairs he or she has seen. Geometry and machine learning are two very different languages in computer vision terms, and Malik said, similar to scientists communicating in English, so a common language should be found in computer vision. ‘In my opinion, it is easier to make everybody learn English, which in this case is neural networks,’ he said.
There are neural networks that start to combine the two worlds of thinking, but Malik noted that putting geometrical data in the language of neural networks requires a fundamental breakthrough. He added that, over the next couple of years, he believes this marriage of geometrical thinking and machine learning-based methods will be achieved.
Malik noted another exciting area of computer vision research is training machines to make predictions, namely predictions about people and social behaviour. This involves work on teaching machines to recognise actions and to make sense of people’s behaviour in light of their possible objectives.
He also suggested that computer vision scientists should take note of the research carried out in neuroscience, since deep neural networks are originally based on findings in neuroscience. ‘Neuroscientists found phenomena in the brain which led us down this path,’ he said, adding that researchers should keep looking in the neuroscience literature to see if there are things that should be exploited.
One other problem in computer vision that Malik felt needed addressing was solving that of limited data. Neural networks learn about the world around them using masses of data, but there will always be instances where there isn’t enough information. He gave the example of work being carried out at the Berkeley Artificial Intelligence Research Laboratory, whereby a robot taught itself to manipulate objects by poking them repeatedly. It’s not being trained explicitly, but it is teaching itself. The work uses two different models – a forward one and an inverse one – and the interplay between them gives an accurate means of decision making for the robot.