CoronaVirus News

From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

Protein language models (PLMs) capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have been shown to be applicable to understanding functional properties of viral proteins like antigenicity. Here, we investigate how these representations more broadly enable assessment of both observed and previously unobserved variation due to mutation. Applied to SARS-CoV-2’s spike protein we demonstrate the PLM, ESM-2, has learned the sequence context within which variation occurs, capturing evolutionary constraint. We demonstrate this recapitulates what conventionally requires MSA data predicting both where variation does and does not accumulate across the protein. We show the PLM-derived measures of amino acid change represent novel metrics as they do not correlate strongly with classical metrics of change in sequences. Applied to SARS-CoV-2 variants across the pandemic we demonstrate that ESM-2 representations encode the relatedness between variants, i.e., their evolutionary history, as well as the distinct nature of variants of concern upon their emergence, associated with shifts in receptor binding and antigenicity. This application of ESM-2 can be used for characterising the evolutionary potential and possible epidemiological impact of both existing and emerging variation, with the potential to be applied to any protein sequence, pathogen or otherwise.