Greg Blackman reports on the complexities of training AllGo Systems' driver monitoring neural networks, which the firm's VP of engineering, Nirmal Kumar Sancheti, spoke about at the Embedded World trade fair
How can a machine be taught to spot if a driver is distracted or drowsy? A number of car manufacturers have already installed driver monitoring systems in their vehicles –Toyota being one of the first in 2006 – but the technology to do this reliably has a lot of inherent complexities.
Nirmal Kumar Sancheti, vice president of engineering for AllGo Systems, presented the firm’s driver monitoring system, called See ‘n Sense, at the Embedded World trade fair in Nuremberg at the end of February.
See ‘n Sense was first demonstrated at CES 2018, and makes use of trained neural networks to identify behaviour that suggests the driver is distracted or sleepy. It was first shown on a GPU, but is now ported and optimised for an Arm platform in order to keep cost down, Sancheti said at Embedded World.
The system considers a number of parameters, including head pose estimation, gaze detection, and eye state analysis - blink rate and blink duration – to reach a conclusion about how attentive or otherwise the driver is. It has to do this in varying light conditions, and recognise features for different ages, genders, ethnicities, and expressions. Making a common framework for all these situations is ‘really difficult’, Sancheti said. The system also has to deal with occlusions, such as caps or scarves covering faces.
AllGo Systems uses neural networks for its classification building blocks, as deep learning can ‘generalise way better’ than classical feature extraction, Sancheti said. Conventional approaches to image processing would struggle with fine tuning parameters and generalising – to work in all light conditions for instance. ‘There are a lot of issues with conventional approaches that we try to solve using deep learning,’ Sancheti added.
While deep learning is preferable to using conventional methods for this type of image processing problem, it still has its drawbacks, namely the huge amount of data required for accurate results, and the computational power needed. ‘No one is ready to put a GPU in a car for a driver monitoring system,’ Sancheti remarked.
See ‘n Sense captures images with an infrared camera and infrared illuminator to avoid interference from visible light. It starts by detecting the person’s face. It then looks at head pose, estimating the way the head is oriented in three angles: yaw – or how much the head is turned – pitch and roll. Another network is run on a cropped image of the eyes to pinpoint where the person is looking, giving pitch and yaw angles for gaze. Head position alone is not enough to say whether the person is distracted, which is why it is combined with gaze direction.
The eyes are then analysed to classify whether the eye is open or closed, as well as how long it has been closed or open. Blink rate and duration are important parameters for detecting drowsiness. The system uses a neural network to make the classification, but because it’s only concentrating on the eye region, which is small, AllGo Systems found that a small network was sufficient. Also, because the system is cropping out the eyes for gaze estimation and eye state analysis, it can deal with a lot of occlusions.
Depth of data
Data is key in deep learning. AllGo Systems gathered data for its deep learning algorithms under different lighting conditions, and using different subjects with different expressions – imaged while yawning, smiling, laughing, etc. The system was also trained to detect the face and eyes when the subject was in different poses - if, for instance, only a portion of the face is visible to the camera.
‘One problem is how do you ground truth the data?’ Sancheti said. ‘If I’ve collected data of a face, how do I tell where you are looking? I don’t know what the angle of your head is from that picture. Manual intervention becomes extremely hard.’
When collecting data for face detection, AllGo Systems made the test subject look in all directions, and asked them to put on different expressions. The person was asked to do a fast blink and a slow blink for the purpose of eye state detection. ‘The hardest problem is that if you want to collect drowsiness data, you cannot ask the person to act sleepy,’ Sancheti explained. ‘We are trying to detect drowsiness based on blink rate and other parameters that use pupil dilation – but no single measure in there is proof of drowsiness.’
For head pose estimation, AllGo Systems used an auxiliary camera trained on markers on the back of the head. So, as the main camera captures an image of the person’s face, the auxiliary camera tracks the markers to show how the head turns.
Collecting data for gaze estimation involved asking people to focus on a screen with a dot that was moved around, and the gaze direction recorded irrespective of head pose.
Once the data is collected and the ground truth established, training was relatively straightforward, according to Sancheti. The company used GPUs and high-end CPUs to train the system, which took a couple of hours to days depending on the data size and the complexity of the problem. The company then optimised See ‘n Sense for an embedded Arm platform.
In terms of choosing a neural network, Sancheti advised not to pick a model that is overly complex. The developer then has to go through a process of optimising the model in terms of quantising the weights – not every weight needs to be floating point, Sancheti said. He added that AllGo Systems would run the training cycle multiple times using fixed weights to find the most efficient process, and said that it’s good to consider different weights at the time of training before truncating the network. The system then was optimised for an embedded target.