Matthews Correlation Coefficient (MCC)

# Matthews Correlation Coefficient (MCC) Parent: [[Biological Sequence Modelling]] A single-number metric for binary classification that behaves well under class imbalance — which is why it is the default reporting metric in genomics and protein tasks, where positive examples are often orders of magnitude rarer than negatives. The formula takes all four entries of the confusion matrix (true positives, true negatives, false positives, false negatives) and combines them into a Pearson-correlation-like quantity ranging from -1 to +1. A score of +1 means perfect prediction; 0 means no better than random; -1 means perfect anti-correlation with the truth. Why it matters for biological sequence tasks: accuracy is misleading when 99% of the examples are negatives, because "always predict negative" gives 99% accuracy while being useless. F1 helps but focuses only on the positive class and ignores true negatives. MCC uses all four cells and is symmetric — swapping which class you call "positive" does not change the score. When reading biological benchmark papers, MCC is the number to trust. A model with 95% accuracy and 0.2 MCC is a model that has memorised the majority class. A model with 80% accuracy and 0.6 MCC is doing real work. The two can coexist on the same dataset. For compression assessment: MCC drops in biological tasks are often much larger than accuracy drops, because the compression disproportionately damages the minority-class signal. Always ask to see MCC alongside accuracy, not instead of it. ## Related - [[Protein Language Models]] - [[Sequence-to-Function Prediction]] --- Tags: #bio #stats #evaluation #kp