Facebook details self-supervised AI that can segment images and video

Facebook announced that it developed an algorithm in collaboration with Inria called DINO that enables the training of transformers, a type of machine learning model, without labeled training data. The company claims it sets a new state-of-the-art among unlabeled data training methods and leads to a model that can discover and segment objects in an image or video without a specific objective.

Segmenting objects is used in tasks ranging from swapping out the background of a video chat to teaching robots that navigate through a factory. But it’s considered among the hardest challenges in computer vision because it requires an AI to understand what’s in an image.

Segmentation is traditionally performed with supervised learning and requires a volume of annotated examples. In supervised learning, algorithms are trained on input data annotated for a particular output until they can detect the underlying relationships between the inputs and output results. However, with DINO, which leverages unsupervised learning (also called self-supervised learning), the system teaches itself to classify unlabeled data, processing the unlabeled data to learn from its inherent structure.

Unsupervised transformers

Transformers enable AI models to selectively focus on parts of their input, allowing them to reason more effectively. While initially applied to speech and natural language processing, transformers have been adopted for computer vision problems as well as image classification and detection.

At the core of so-called vision transformers are self-attention layers — each spatial location builds a representation by “attending” to other locations. That way, by “looking” at other, potentially distant pieces of an image, the transformer builds a rich, high-level understanding of the overall scene.

DINO works by matching the output of a model over different views of the same image. In doing this, it can effectively discover object parts and shared characteristics across images. Moreover, DINO can connect categories based on visual properties, for example clearly separating animal species with a structure that resembles the biological taxonomy.

Above: Facebook’s DINO system can segment images in an unsupervised fashion.

Image Credit: Facebook

Facebook claims that DINO is also among the best at identifying image copies, even though it wasn’t designed for this. That means that in the future, DINO-based models could be used to identify misinformation or copyright infringement.

“By using self-supervised learning with transformers, DINO opens the door to building machines that understand images and video much more deeply,” Facebook wrote in a blog post. “The need for human annotation is usually a bottleneck in the development of computer vision systems. By making our approaches more annotation-efficient, we allow models to be applied to a larger set of tasks and potentially scale the number of concepts they can recognize.”

PAWS

Facebook also today detailed a new machine learning approach called PAWS that ostensibly achieves better classification accuracy than previous state-of-the-art and semi-supervised approaches. Notably, it also requires an order of magnitude — 4 to 12 times — less training, making PAWS a potential fit for for domains where there aren’t many labeled images, like medicine.

Residing between supervised and unsupervised learning, semi-supervised learning accepts data that’s partially labeled or where the majority of the data lacks labels. The ability to work with limited data is a key benefit of semi-supervised learning because data scientists spend the bulk of their time cleaning and organizing data.

Facebook PAWS

PAWS achieves its results by leveraging a portion of labeled data in conjunction with unlabeled data. Given an unlabeled training image, PAWS generates two or more views of the image using random data augmentations and transformations. It then trains a model to make the representations of these views similar to one another.

Unlike self-supervised methods that directly compare the representations, PAWS uses a random subsample of labeled images to assign a “pseudo-label” to the unlabeled views. The pseudo-labels are obtained by comparing the representations of the unlabeled views with representations of labeled support samples. Because of this, PAWS doesn’t learn “collapsing representations” where all images get mapped to the same representation, a common issue for self-supervised methods.

“With DINO and PAWS, the AI research community can build new computer vision systems that are far less dependent on labeled data and vast computing resources for training,” the Facebook statement continued. “We hope that our experiments will show the community the potential of self-supervised systems trained on [visual transformers] and encourage further adoption.”

Both DINO and PAWS are available in open source.