MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ
Meta AI recently open-sourced NLLB-200, an AI model that can translate between any of over 200 languages. NLB-200 is a 54.5B parameter Mixture of Experts (MoE) model that was trained on a dataset containing more than 18 billion sentence pairs. On benchmark evaluations, NLLB-200 outperforms other state-of-the-art models by up to 44%.
The model was developed as part of Meta’s No Language Left Behind (NLLB) project. This project is focused on providing machine translation (MT) support for low-resource languages: those with fewer than one million publicly available translated sentences. To develop NLLB-200, the researchers collected several multilingual training datasets by hiring professional human translators as well as by mining data from the web. The team also created and open-sourced an expanded benchmark dataset, FLORES-200, that can evaluate MT models in over 40k translation directions. According to Meta,
Translation is one of the most exciting areas in AI because of its impact on people’s everyday lives. NLLB is about much more than just giving people better access to content on the web. It will make it easier for people to contribute and share information across languages. We have more work ahead, but we are energized by our recent progress….
Meta AI researchers have been working on the problems of neural machine translation (NMT) and low-resource languages for many years. In 2018, Meta released Language-Agnostic SEntence Representations (LASER), a library for converting text to an embedding space that preserves sentence meaning across 50 languages. 2019 saw the release of the first iteration of FLORES evaluation dataset, which was expanded to 100 languages in 2021. In 2020, InfoQ covered the release of Meta’s M2M-100, the first single model that could translate between any pair from 100 languages.
As part of the latest release, the FLORES benchmark was updated to cover 200 languages. The researchers hired professional translators to translate the FLORES sentences into each new language, with an independent set of translators reviewing the work. Overall, the benchmark contains translations of 3k sentences sampled from the English version of Wikipedia.
For training the NLLB-200 model, Meta created several multilingual training datasets. NLLB-MD, a dataset to evaluate the generalization of the model, contains 3k sentences from four non-Wikipedia sources, also professionally translated into six low-resource languages. NLLB-Seed contains 6k sentences from Wikipedia professionally translated into 39 low-resource languages and is used for “bootstrapping” model training. Finally, the researchers built a data-mining pipeline to generate a multilingual training dataset containing over 1B sentence pairs in 148 languages.
The final NLLB-200 model is based on the Transformer encoder-decoder architecture; however, every 4th Transformer block has its feed-forward layer replaced with a Sparsely Gated Mixture of Experts layer. To compare the model with existing state-of-the-art performance, the team evaluated it on the older FLORES-101 benchmark. NLLB-200 outperformed other models by 7.3 sentence-piece BLEU points on average: a 44% improvement.
Several of the NLLB team joined a Reddit “Ask Me Anything” session to answer questions about the work. When one user asked about challenges posed by the low-resource languages, research scientist Philipp Koehn replied:
Our main push was towards languages that were not served by machine translation before. We tend to have less pre-existing translated texts or even any texts for them – which is a problem for our data-driven machine learning methods. Different scripts are a problem, especially for translating names. But there are also languages that express less information explicitly (such as tense or gender), so translating from those languages requires inference over a broader context.
The NLLB-200 models and training code, as well as the FLORES-200 benchmark, are available on GitHub.