Google AI Updates Universal Speech Model to Scale Automatic Speech Recognition Beyond 100 Languages
MMS • Daniel Dominguez
Article originally posted on InfoQ. Visit InfoQ
Google AI have recently unveiled a new update for their Universal Speech Model (USM), to support the 1,000 Languages Initiative. The new model performs better than OpenAI Whisper for all segments of automation speech recognition.
A universal speech model is a machine learning model trained to recognize and understand spoken language across different languages and accents. USM is a family of state-of-the-art speech models with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning 300+ languages. According to Google, USM can conduct automatic speech recognition (ASR) on under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani to frequently spoken languages like English and Mandarin.
The initial phase of the training process involves unsupervised learning on speech audio from a vast array of languages. Subsequently, the model’s quality and language coverage can be improved through an optional pre-training stage that employs text data. The decision to include this stage is based on the availability of text data. When the second stage is incorporated, the USM achieves superior performance. In the final stage of the training pipeline, downstream tasks such as automatic voice recognition or automatic speech translation are fine-tuned using minimal supervised data.
According to the research, the two significant challenges in Automatic Speech Recognition (ASR) are scalability and computational efficiency. Traditional supervised learning methods are not scalable because it is challenging to collect enough data to build high-quality models, especially for languages with no representation.
Nevertheless, self-supervised learning is a better method for scaling ASR across numerous languages since it can make use of more accessible audio-only data. A flexible, efficient, and generalizable learning algorithm that can handle large amounts of data from various sources and generalize to new languages and use cases without requiring complete retraining is required in order for ASR models to improve in a computationally efficient manner while increasing language coverage and quality.
A large unlabeled multilingual dataset was used to pre-train the model’s encoder and fine-tune it on a smaller collection of labeled data that makes it possible to recognize underrepresented languages. Furthermore, the training procedure successfully adapts new data and languages.
Universal speech models play a crucial role in facilitating natural and intuitive interactions between machines and humans, and can serve as a bridge between diverse languages and cultures. These models hold immense potential for various applications, such as virtual assistants, voice-activated devices, language translation, and speech-to-text transcription.
With this new update, the USM is now one of the most comprehensive speech recognition models in the world. This development is a significant step forward in Google’s efforts to create a more inclusive and accessible internet, as it will allow people who speak minority or lesser-known languages to engage with technology in a more meaningful way.