DeepMind Trains 80 Billion Parameter AI Vision-Language Model Flamingo

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

DeepMind recently trained Flamingo, an 80B parameter vision-language model (VLM) AI. Flamingo combines separately pre-trained vision and language models and outperforms all other few-shot learning models on 16 vision-language benchmarks. Flamingo can also chat with users, answering questions about input images and videos.

The model was announced in a blog post by lead researchers Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, and Antoine Miech. Flamingo is based on two previous models developed by DeepMind: Chinchilla, a 70B parameter language generation model; and Perceiver, a multimodal classifier model. Flamingo combines these two models into a single neural network, which is then trained on sequences of interleaved image and text data. The result is an AI that can learn new vision-language tasks with little or no additional training data. According to Alayrac et al:

Models like Flamingo hold great promise to benefit society in practical ways and we’re continuing to improve their flexibility and capabilities so they can be safely deployed for everyone’s benefit. Flamingo’s abilities pave the way towards rich interactions with learned visual language models that can enable better interpretability and exciting new applications, like a visual assistant which helps people in everyday life—and we’re delighted by the results so far.

Multimodal VLMs, such as CLIP, have proven successful at zero-shot learning; however, because such models only provide a score indicating the similarity between an image and a textual description, their range of tasks is limited. Other VLMs, such as DALL-E, can generate photo-realistic images from a description, but do not generate language, and so cannot perform tasks such as visual question answering (VQA) or image captioning.

Because large generative language models such as GPT-3 have been shown to have good performance at few-shot learning on a wide range of natural language processing (NLP) tasks, the DeepMind team chose to build on their Chinchilla language model, which outperforms GPT-3 on many of these tasks. This required several changes to Chinchilla. First was the need to handle multimodal data, without causing negative impact on the model’s language abilities. To solve this, the team interleaved new cross-attention layers with the existing self-attention layers, which were kept frozen during training.

To allow support for both single-frame images as well as video, the researchers incorporated a Perceiver model that generates a “small fixed number of visual tokens” for both images and videos. This improved the model’s scalability with input size. Finally, the team needed a large combined image-text training dataset. For this, the team scraped text and images from about 43 million web pages to create the MultiModal MassiveWeb (M3W) dataset, which contains 185M images and 182BG of text. Flamingo was trained on a combination of M3W and several other pre-existing image-text datasets.

To evaluate Flamingo, DeepMind tested it on 16 multimodal benchmarks for a range of tasks including visual dialogue, VQA, captioning, and image classification. In few-shot learning scenarios, Flamingo outperformed previous best results “by a large margin.” On six of the benchmarks, Flamingo outperformed state-of-the-art fine-tuned models without itself being fine-tuned; instead, Flamingo was used in a few-shot scenario and given only 32 samples, “around 1000 times less” than the fine-tuned models.

In a Reddit discussion about Flamingo, one user noted:

Any work that can reduce the required training data, and that can generalize understanding, is going to be incredibly relevant. There are so many different advances that these companies are trying to combine to create generalized AI, it’s amazing to see. I imagine we’ll see much more research on catastrophic forgetting this year too.

Multimodal AI is an active research topic. Earlier this year, InfoQ covered Data2vec, a multimodal AI from Meta which can perform a variety of computer vision and speech recognition tasks. Last year, InfoQ covered DeepMind’s Perceiver, and more recently covered DeepMind’s new generalist AI model Gato, which can perform “more than 600 different tasks” including image captioning and robot control.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.