Deep Learning Transformers Transform AI Vision

New algorithms challenge convolutional neural networks for vision processing.

June 12, 2023

3 Min Read
GettyImages-deeplearning1422693944.jpg
Deep learning algorithms are now being used to improve the accuracy of machine vision.Alexey Yaremenko/ iStock / Getty Images Plus

Gordon Cooper, Product Manager, Synopsys Solutions Group

With the continual evolution of modern technology systems and devices such as self-driving cars, mobile phones, and security systems that include assistance from cameras, deep learning models are quickly becoming essential to enhance image quality and accuracy.

For the past decade, convolutional neural networks (CNNs) have dominated the computer vision application market. However, transformers, which were initially designed for natural language processing such as translation and answering questions, are now emerging as a new algorithm model. While they likely won’t immediately replace CNNs, transformers are being used alongside CNNs to ensure the accuracy of vision processing applications such as context-aware video inference. 

As the most widely used model for vision processing over the past decade, CNNs offer an advanced deep learning model functionality for classifying images, detecting an object, semantic segmentation (grouping or labeling every pixel in an image), and more. However, researchers were able to demonstrate that transformers can beat the latest advanced CNNs’ accuracy with no modifications made to the system itself except for adjusting the image into small patches.

Related:I Prompt, Therefore I Am: The Role of the Designer in the Age of Generative AI

In 2020, Google Research Scientists published research on the vision transformer (ViT), a model based on the original 2017 transformer architecture specializing in image classification. These researchers found that the ViT “demonstrate[d] excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources.” While they require training with large data sets, ViTs are now beating CNNs in accuracy.

Differences Between CNNs and Transformers

The primary difference between CNNs and transformers is how each model blends information from neighboring pixels and their respective scopes of focus. While CNNs’ data is symmetric, for example based on a 3x3 convolution which calculates a weighted sum of nine pixels around the center pixel, transformers use an attention-based mechanism. Attention networks revolve around the learned properties besides location and have a greater ability to learn and demonstrate more complex relationships. This leads to an expanding contextual awareness when the system attempts to identify an object. For example, a transformer, like a CNN, can discern that the object in the road is a stroller rather than a motorcycle. Rather than expending energy taking in less useful pixels of the entire road, a transformer can home in on the most important part of the data.

Related:AI Speeds Design Simulation

Transformers are able to grasp context and absorb more complex patterns to detect an object. In particular, swin (shifted window) transformers reach the highest accuracy for object detection (COCO) and semantic segmentation (ADE20K). While CNNs are usually only applied to one still image at a time without any context of the frame before and after, the transformer can better deploy across video frames and used for action classification.

Drawbacks

Currently, designers must take into account that while transformers can achieve high accuracy, they will run at much fewer frames-per-second (fps) performance and require many more computations and data movement. In the near term, integrating CNNs and transformers will be key to establishing a stronger foundation for future vision processing development. However, even though CNNs are still considered a mainstream vision processing application, deep learning transformers are rapidly advancing and improving upon the capabilities of CNNs.

As research continues, it may not take long for transformers to completely replace CNNs for real-time vision processing applications, amplifying contextual awareness for complex patterns as well as providing higher accuracy will be beneficial for future AI applications.

 


 

Sign up for the Design News Daily newsletter.

You May Also Like