Biological Sequence Modelling

# Biological Sequence Modelling Parent: [[Model Compression & Edge AI MOC]] Transformers learned to read protein and DNA because biological sequences are, at the relevant abstraction, just another token stream. The interesting part is what this unlocks: sequence-to-function prediction, variant effect analysis, protein design, and — if compression holds — deployment at the point of sample rather than in the cloud. The domain has its own benchmarks, its own tokenisation choices, and its own evaluation metrics. Do not assume NLP intuitions transfer cleanly. A "small" model in genomics is often a surprisingly capable one, because the underlying information density per token is different. ## Key Concepts - [[Genomic Tokenisation]] — nucleotide, k-mer, and learned BPE schemes - [[Protein Language Models]] — ESM family and descendants - [[Nucleotide Transformer Benchmark Suite]] - [[Matthews Correlation Coefficient (MCC)]] — the default metric for imbalanced classification in genomics - [[Sequence-to-Function Prediction]] - [[Variant Effect Prediction]] - [[Transfer Learning from Large Biological Corpora]] ## Key Questions - What is the tokenisation scheme, and how does it interact with compression? (Some tokenisers are more compressible than others.) - Which benchmark suite is being reported, and is the train/test split leak-free? - Is MCC (or the equivalent metric) the right measure for this task class, or is it hiding class imbalance? - Does a small-parameter model actually beat a large one on the same benchmark, or is the comparison apples-to-oranges (different pretraining, different fine-tuning data)? - How domain-specific is the pretraining corpus? A model trained on human DNA may not transfer to microbial. - Is there a realistic deployment story (point-of-sample sequencing hardware), or is this still cloud-only? ## Reading - Rives et al., ESM protein language model family - Dalla-Torre et al., "The Nucleotide Transformer" (2023) - Lin et al., ESM-2 / ESMFold (2022–2023) - Benegas et al., "GPN: DNA language models are an accurate predictor of variant effects" (2023) - Any recent review on genomic foundation models --- Tags: #bio #ai #genomics #kp