Embedded vision looks set to disrupt the surveillance market, according to Michael Tusch of ARM. Greg Blackman reports on his presentation at the Embedded Vision Summit
By 2030, most surveillance cameras will not produce video. This prediction was made by Michael Tusch, founder of UK embedded imaging startup Apical, now owned by ARM, at the Embedded Vision Summit, an annual computer vision conference run by the Embedded Vision Alliance taking place from 1 to 3 May in Santa Clara, California.
So what will IP cameras produce if not video? The answer Tusch gave is metadata, largely because the volume of information IP cameras generate is becoming increasingly difficult to transmit and store.
Tusch noted that surveillance is one traditional market that is ‘ripe for disruption’ by embedded vision, a term that is difficult to define but typically involves image processing onboard an embedded computing platform like an ARM processor.
To illustrate the scale of the problem, Tusch first noted that IP cameras are cheap and that 120 million were sold in 2016, all of which can stream in HD and at 30 frames per second. He said that, assuming a single HD IP camera streams at 10Mb/s, if all 120 million new IP cameras were connected to the internet, that would equate to 1.2 petabits per second of web traffic. At a monthly rate, that’s 400 exobytes per month, which is four times what Cisco projects global IP traffic to be this year.
Of course, not all those 120 million new cameras are connected to the internet all the same time, and some are replacing old cameras, but this still shows that IP cameras contribute a lot of network traffic, Tusch said.
Data transmission is one problem; storing it is another. YouTube stores 500 hours of video each minute, Tusch quoted in his presentation; around 3,000 YouTube’s would be needed to store the video created by 120 million new IP cameras.
‘If you look today at the largest hosters of video, they are not Facebook and they are not YouTube, they are mid-sized security video storage companies,’ Tusch said.
So, who will watch all this video? The answer is machines and artificial intelligence. ‘That gives us an opportunity to deal with the problems of transmission and storage,’ Tusch added.
Deep learning algorithms – a new type of image processing algorithm whereby large amounts of data are used to train the algorithm to find patterns – are now able to categorise scenes in video, as shown by Google and others that use the technology to make web searches more effective.
Classifying a video of a fainting goat is not funny in the same way as watching the video is, but describing a still from a security camera feed in a few bytes of data is really all that is needed for surveillance – watching the video is not necessary at all. ‘There’s really no useful information a pixel can convey, they’re completely redundant,’ Tusch said.
‘This gives us an opportunity,’ he continued. ‘Maybe we don’t need 3,000 YouTube’s a year to store security footage; if we can only categorise scenes successfully at the edge [i.e. at the device rather than on a server or in the cloud], and get rid of all the pixels at that point, we don’t even have to transmit the video.’
However, to make sense of security footage, it’s not just a case of classifying the scene, but also identifying the behaviour of objects – people normally – within those frames. There is the potential to do face recognition, for instance, but this kind of processing if done on a server is extremely expensive, Tusch noted.
To find people at different scales and different locations would require the equivalent of at least 300 full classifications per frame. Tusch calculated – again just to illustrate the problem – that with 120 million new IP cameras this would cost $132 million per hour running on Amazon Web Services.
Tusch said that an argument could be made that the systems don’t need to process at full frame and they can rely on motion detection and trigger uploads when needed, but he said that in his experience full frame and full resolution processing is necessary to achieve the accuracy.
‘You can see that the problem of doing computer vision on servers is a serious one, even if you can get the video into the cloud and store it,’ he commented. ‘These problems of transmission, storage and cloud compute are so great that when people talk about the need to optimise or balance computer vision between servers and the edge, what they really mean is that all of the computer vision has to happen on the device, at the edge, and that the conversion from pixels to objects has to happen there, otherwise you run into scalability issues at the very first stage.’
Can recognition and localisation be computed on the device today? Tusch said that traditional embedded computing architectures are not fast enough for the kind of complex scene analysis needed for security applications. Running neural networks, what is used for deep learning, means many gigabits per second of pixel data going to and from memory, which puts a limit on the computational performance. ‘I would argue that traditional architectures are not remotely good enough and many GPUs and DSPs, certainly not CPUs, can’t get anywhere near the kind of performance needed,’ he said.
There are other architectures, and Apical developed an engine called Spirit, now owned by ARM, that has low power consumption and could potentially be used for this. ‘I’m not familiar with anything today that can do this, but I’m sure that within a year or two this problem will be solved,’ Tusch remarked.
Tusch said that by being able to do complex analysis at the edge, on the device, and reduce pixels into metadata, then this could be streamed to the cloud and analysed there. Take iris recognition, which requires around 100 pixels and can be done at a distance of 1.7 metres with a 25 megapixel sensor. ‘You can’t even think about encoding video at that resolution at the edge let alone transmitting it and storing it,’ Tusch said, ‘but if we have an embedded vision engine running close to the sensor that can process every pixel, we have a very viable approach. We can analyse every frame, we can find faces, we can crop irises, and then we could process locally or send the cropped encoded jpeg up to a server to do the recognition. By taking video out of the problem, and replacing it with either pure metadata or at the very least regions of interest, you get something that’s practical and cheap.’
‘I think that by 2022 all new IP cameras in the world will have some kind of embedded convolutional neural network or equivalent inside them,’ he continued, adding that at some point in the future – he gave 2030 – the vast majority of cameras connected to the internet will not produce any video whatsoever.
For that to happen, processing at the edge will have to be solved, as well as other challenges like real-time scene analysis. The task of being able to replace pixels completely with object detection at the edge requires a very high accuracy, which doesn’t exist yet, Tusch said.
He also added that, for surveillance applications, all kinds of event have to be detected, and that without deep learning it would be hard to throw away information at the source. For object recognition and object detection, deep learning has strong advantages over traditional approaches.