Google's Image-Text AI LIMoE Outperforms CLIP on ImageNet Benchmark

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Researchers at Google Brain recently trained Language-Image Mixture of Experts (LIMoE), a 5.6B parameter image-text AI model. In zero-shot learning experiments on ImageNet, LIMoE outperforms CLIP and performs comparably to state-of-the-art models while using fewer compute resources.

The model and several experiments were described in a paper published on arXiv. LIMoE combines a sparse mixture-of-experts (MoE) scheme with the Transformer architecture, which allows for increasing the number of model parameters while maintaining low computational requirements during inference. Unlike CLIP and other “two-tower” image-text models that use separate encoder networks for images and text, LIMoE has a single encoder for both modalities which has the potential for better scalability and generality. According to the Google Brain team:

Multimodal models that handle many tasks are a promising route forward, and there are two key ingredients for success: scale, and the ability to avoid interference between distinct tasks and modalities while taking advantage of synergies. Sparse conditional computation is an excellent way of doing both. It enables performant and efficient generalist models that also have the capacity and flexibility for the specialization necessary to excel at individual tasks, as demonstrated by LIMoE’s solid performance with less compute.

The development of LIMoE is part of Google’s Pathways strategy for developing next-generation AI models. One tenet of this effort is the use of sparse neural network models, wherein only a few of pathways through the network are activated. This means that using the model for inference requires a fraction of the compute resources—and thus the energy—used by a dense model of comparable size. InfoQ recently reported on Google’s PaLM language model, also developed as part of the Pathways project. In 2021, InfoQ reported on Google’s Switch Transformer, a sparse MoE language model that pre-dated the official Pathways announcement but is designed using some of its principles.

LIMoE is based on the Transformer architecture, in which the sequence of input tokens is processed by a series of identical blocks which contain several neural networks layers, including an attention layer and a simple feed-forward layer. In LIMoE, the feed-forward layer is replaced by an expert layer which contains parallel feed-forward layers called experts, and a router that determines which experts handle a given token.

The Brain team found several challenges in training this model. One challenge, common to all MoE models, is to make sure that the model does not collapse; that is, that the router does not always choose the same expert. Another challenge, specific to multi-modal data, is “modality unbalance”; for example, the dataset may contain much more text than image data. In this case, model collapse can occur for the smaller modality. To remedy these challenges, the team introduced two new training losses: local entropy, which “encourages concentrated router weights,” and global entropy, which results in “diverse expert usage.”

Lead author Basil Mustafa posted a Twitter thread about the work and answered user questions. When one user asked him to elaborate on how the network was “finicky” to train, Mustafa replied:

LIMoE setup is stable in that, given good [hyper-parameters], it’s [very] reliably reproducible & good. However [in my opinion] still a bit sensitive to these [hyper-parameters]; sometimes [the] recipe fails and needs trial and error tuning.

Google has not released the LIMoE model code, but Mustafa suggested the code would be available on GitHub along with a sparse MoE model for vision within “a few months.”

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.