Protein Language Models

# Protein Language Models Parent: [[Biological Sequence Modelling]] Transformers trained on large corpora of protein sequences, treating each amino acid as a token. The family that matters most is ESM (Evolutionary Scale Modeling), from Meta's FAIR team, with ESM-1b, ESM-2, and ESMFold as the canonical progression. These models learn representations that encode structural, functional, and evolutionary information — emergent from the pretraining objective of masked amino-acid prediction alone. The surprising empirical finding is how much structural biology is implicit in sequence statistics. A protein language model trained only to fill in masked amino acids produces internal representations from which you can read out secondary structure, contact maps, and even three-dimensional folds. ESMFold uses this to predict structures directly from sequence at roughly the quality of AlphaFold, but faster because it does not require multiple sequence alignments at inference time. For downstream tasks, protein language models are now the default starting point. Variant effect prediction, protein-protein interaction, stability prediction, and functional annotation are all typically done by fine-tuning or extracting embeddings from a pretrained PLM. The representations are that good. The deployment question for edge or point-of-sample biology is whether compressed PLMs can retain enough of the signal to be useful. Smaller PLMs (ESM-2 at 8M, 35M, 150M parameters) exist precisely for this reason, and they often perform respectably — not at the level of the 15B variant, but close enough for narrow tasks. This is an active area where the Chinchilla logic genuinely applies: more tokens (more protein sequences) in pretraining matter more than more parameters for many downstream tasks. ## Related - [[Sequence-to-Function Prediction]] - [[Variant Effect Prediction]] - [[Matthews Correlation Coefficient (MCC)]] --- Tags: #bio #ai #proteins #kp