OpenAI Releases 1.6 Billion Parameter Multilingual Speech Recognition AI Whisper

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

OpenAI recently released Whisper, a 1.6 billion parameter AI model that can transcribe and translate speech audio from 97 different languages. Whisper was trained on 680,000 hours of audio data collected from the web and shows robust zero-shot performance on a wide range of automated speech recognition (ASR) tasks.

Whisper uses an encoder-decoder Transformer architecture and processes audio in 30-second chunks. Unlike most state-of-the-art ASR models, Whisper is not fine-tuned on any benchmark dataset; instead, it is trained using “weak” supervision on a large-scale, noisy dataset of speech audio and paired transcription text collected from the internet. In zero-shot evaluations on a set of speech recognition datasets, Whisper made on average 55% fewer errors than Wav2Vec, a baseline model. According to OpenAI:

We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing…We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications.

Training a deep-learning speech-recognition model using only supervised learning would require a large dataset containing audio data with corresponding accurate or gold standard transcripts. Acquiring such a dataset can be challenging, so researchers usually turn to transfer learning: fine-tuning models that have been pretrained on a large publicly-available dataset of audio only. For example, Meta’s XLS-R is pretrained on 436K hours of speech audio, then fine-tuned on much smaller benchmark-specific training sets.

However, the OpenAI researchers contend that this process has downsides; in particular, the fine-tuned models often do not generalize well, “which limits their usefulness and robustness.” Instead of pretraining Whisper on audio-only data, the team collected a large dataset of audio with “noisy” transcripts: 680,000 hours of audio with “a large amount of subpar transcripts” scraped from the Internet; 117,000 hours of this data was speech in 96 non-English languages. By trading quality for quantity, the resulting model achieves good zero-shot performance on a wide range of tasks, including speech transcription in several languages as well as translation and language identification.

To evaluate Whisper, the team compared its performance on a set of benchmark tasks to several baseline models, focusing on zero-shot scenarios. Although Whisper had “unremarkable” performance on the LibriSpeech benchmark for English transcription, on other datasets it outperformed baseline models that were fine-tuned on LibriSpeech “by large amounts.” On the CoVoST2 translation benchmark, Whisper set a new zero-shot state-of-the-art record, without using any of the benchmark training data. On long-form transcription, Whisper outperformed NVIDIA STT, the best open-source model, and was “very close to human-level accuracy.”

In a Hacker News discussion about Whisper, several users wryly commented on OpenAI “finally living up to their name” by open-sourcing Whisper, referring to OpenAI’s reluctance to do the same with some of their other models. Another user commented on Whisper’s performance:

The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I’ve seen. Will be super curious to see other folks trying it out and seeing if it’s as robust as it seems, including when confronted with audio speech with natural tics and uhhh’s and uhmm’s and everything in-between.

The Whisper source code and pre-trained model files are available on GitHub.
 

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.