Researchers from the Massachusetts Institute of Technology (MIT), in collaboration with the MIT-IBM Watson AI Lab, have introduced a computer vision model that streamlines the processing of high-resolution images.
The model vastly reduces the computational complexity of semantic segmentation – the process of categorising every pixel in high-resolution images, for example when identifying and tracking objects as they move across a scene. It does so while maintaining or even exceeding the accuracy of current state-of-the-art models, which are often very computationally intensive.
It could therefore be deployed effectively on devices where hardware resource is limited, for example in the on-board computers of autonomous vehicles, where it could enable important split-second decisions to be made that could help avoid collisions on the road. It could also improve the efficiency of other high-resolution computer vision tasks, such as medical image segmentation.
How does the computer vision model work?
In recent years, a powerful new type of machine learning model, known as a vision transformer has been deployed effectively for semantic segmentation.
Transformers were originally developed for natural language processing, where they encode each word in a sentence as a token and then generate something called an ‘attention map’, which captures each token’s relationships with other tokens. The attention map is then used to help the model understand context when making predictions.
Vision transformers use this concept to chop an image into patches of pixels and then encode each patch into a token before generating an attention map. In generating the map, the model uses a non-linear similarity function that directly learns the interaction between each pair of pixels. In this way, the model develops what is known as a ‘global receptive field’, which allows it to access all the relevant parts of the image.
However, due to high-resolution images containing potentially millions of pixels chunked into thousands of patches, such attention maps can quickly become exceptionally large. This leads to a quadratically increasing amount of computation being required to make predictions as the resolution of the image increases. Consequently, while these models are accurate, they are too slow to process high-resolution images in real time on an edge device, such as a sensor or mobile phone.
The researchers’ new model, which forms part of a series called EfficientViT, follows a similar albeit simplified concept when constructing attention maps – replacing the nonlinear similarity function typically used by transformers with a linear similarity function. This allows the model to rearrange the order of operations, which reduces the number of calculations both without changing functionality and without losing the global receptive field. This leads to the amount of computation needed for a prediction growing linearly (instead of quadratically) as the image resolution grows.
“While researchers have been using traditional vision transformers for quite a long time, and they give amazing results, we want people to also pay attention to the efficiency aspect of these models,” said Song Han, an Associate Professor in MIT's Department of Electrical Engineering and Computer Science (EECS) and a Senior Author of the research paper describing the new model. “Our work shows that it is possible to drastically reduce the computation so this real-time image segmentation can happen locally on a device.”
However, Han does acknowledge that the linear attention of the new model only captures global context about the image, losing local information and decreasing the overall accuracy.
The researchers therefore included two extra components in their model, each of which adds only a small amount of computation, to help compensate for the accuracy loss. One element helps the model capture local feature interactions, mitigating the linear function’s weakness in local information extraction. The second, a module that enables multiscale learning, helps the model recognise both large and small objects.
The model was also developed using hardware-friendly architecture, allowing it to run more easily on devices such as virtual reality headsets or edge computers aboard autonomous vehicles.
Consequently, when the researchers tested their finished model on datasets used for semantic segmentation, they found that it performed up to nine times faster on a mobile device than other popular vision transformer models, with the same or even better accuracy.
“Now, we can get the best of both worlds and reduce the computing to make it fast enough that we can run it on mobile and cloud devices,” confirmed Han.
What's next?
The researchers now want to apply their work to speeding up generative machine-learning models, such as those used to create new images. They also want to continue scaling up EfficientViT for other vision tasks.
“Efficient transformer models, pioneered by Professor Song Han’s team, now form the backbone of cutting-edge techniques in diverse computer vision tasks, including detection and segmentation,” says Lu Tian, senior director of AI algorithms at AMD, Inc., who was not involved with the work. “Their research not only showcases the efficiency and capability of transformers, but also reveals their immense potential for real-world applications, such as enhancing image quality in video games.”
Jay Jackson, global vice president of artificial intelligence and machine learning at Oracle, who was also not involved with the work, added: “Model compression and light-weight model design are crucial research topics toward efficient AI computing, especially in the context of large foundation models. Professor Song Han’s group has shown remarkable progress compressing and accelerating modern deep learning models, particularly vision transformers. Oracle Cloud Infrastructure has been supporting his team to advance this line of impactful research toward efficient and green AI.”
Efforts are already underway to further improve the accuracy and speed of the models, implementing learnings from tests performed on other datasets. At the same time, the researchers want to scale up EfficientViT for broader vision tasks, promising an efficient and accessible future for high-resolution computer vision across diverse applications.