MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ
Meta AI Research recently announced ESMFold, an AI model for predicting protein structure from a sequence of genes. ESMFold is built on a 15B parameter Transformer model and achieves accuracy comparable to other state-of-the-art models with an order-of-magnitude inference time speedup.
The model and several experiments were described in a paper published on bioRxiv. In contrast to other models such as AlphaFold2, which rely on external databases of sequence alignments, ESMFold uses a Transformer-based language model, ESM-2, an updated version of their Evolutionary Scale Modeling (ESM) model which learns the interactions between pairs of amino acids in a protein sequence. This allows ESMFold to predict protein structure from 6x to 60x faster than AlphaFold2. Using ESMFold, the Meta team predicted the structure of one million protein sequences in less than a day. According to the researchers:
[R]apid and accurate structure prediction with ESMFold can help to play a role in structural and functional analysis of large collections of novel sequences. Obtaining millions of predicted structures within practical timescales can help reveal new insights into the breadth and diversity of natural proteins, and enable the discovery of new protein structures and functions.
Genetic codes in DNA are “recipes” for creating protein molecules from sequences of amino acids. Although these sequences are linear, the resulting proteins are folded into complex 3D structures which are key to their biological function. Traditional experimental methods for determining protein structure require expensive specialized equipment and may take years to complete. In late 2020, DeepMind’s AlphaFold2 solved the 50-year-old Protein Structure Prediction challenge of quickly and accurately predicting protein structure from the amino acid sequence.
Besides the raw amino acid sequence, the input to AlphaFold2 includes multiple sequence alignment (MSA) information, which links several different sequences based on the assumption of having a common evolutionary ancestor; this external database creates a performance bottleneck. By contrast, ESMFold uses a learned language model representation that requires only the amino acid input sequence, which simplifies the model architecture and improves runtime performance. The language model representation is fed into a downstream component similar to AlphaFold2’s, which predicts 3D structure.
The Meta team evaluated ESMFold on CAMEO and CASP14 test datasets and compared the results to both AlphaFold2 and another model, RoseTTAFold. ESMFold’s template modeling score (TM-score) was 83 on CAMEO and 68 on CASP14, compared to 88 and 84 for AlphaFold2 and 82 and 81 for RoseTTAFold. The researchers noted that there was a high correlation between ESMFold’s TM-score and the underlying language model’s perplexity, which implies that “improving the language model is key to improving single-sequence structure prediction accuracy.”
Meta and other organizations have been researching the use of language models in genomics for several years. In 2020, InfoQ covered Google’s BigBird language model which outperformed baseline models on two genomics classification tasks. That same year, InfoQ also covered Meta’s original open-source ESM language model for computing an embedding representation for protein sequences. In 2021, InfoQ covered DeepMind’s AlphaFold2, and DeepMind recently announced the release of AlphaFold2’s predictions of structures “for nearly all catalogued proteins known to science.”
Several of the Meta team participated in a Twitter thread answering questions about the work. In response to a question about the model’s maximum input sequences length, researcher Zeming Lin replied:
Currently able to do up to 3k length proteins, though at some point it becomes computationally bound.
Meta has not yet open-sourced ESMFold, although Lin says that the model will be open-sourced in the future.