MMS • Daniel Dominguez
Article originally posted on InfoQ. Visit InfoQ
Google AI released a research paper about Muse, a new Text-To-Image Generation via Masked Generative Transformers that can produce photos of a high quality comparable to those produced by rival models like the DALL-E 2 and Imagen at a rate that is far faster.
Muse is trained to predict randomly masked image tokens using the text embedding from a large language model that has already been trained. This job involves masked modeling in discrete token space. Muse uses a 900 million parameter model called a masked generative transformer to create visuals instead of pixel-space diffusion or autoregressive models.
Google claims that with a TPUv4 chip, a 256 by 256 image can be created in as little as 0.5 seconds as opposed to 9.1 seconds using Imagen, their diffusion model that they claim offers an “unprecedented degree of photorealism” and a “deep level of language understanding.” TPUs, or Tensor Processing Units, are custom chips developed by Google as dedicated AI accelerators.
According to the research, Google AI has trained a series of Muse models with varying sizes, ranging from 632 million to 3 billion parameters, finding that conditioning on a pre-trained large language model is crucial for generating photorealistic, high-quality images.
Muse also outperforms Parti, a state-of-the-art autoregressive model, since it uses parallel decoding and is more than 10 times faster at inference time than the Imagen-3B or Parti-3B models and three times faster than Stable Diffusion v1.4 based on tests using hardware that is equivalent.
Muse creates visuals that correspond to the various components of speech found in the input captions, such as nouns, verbs, and adjectives. Additionally, it shows knowledge of both visual style and multi-object features like compositionality and cardinality.
Generative image models have come a long way in recent years, thanks to novel training methods and improved deep learning architectures. These models have the ability to generate highly detailed and realistic images, and they’re becoming increasingly powerful tools for a wide range of industries and applications.