Emergent Capabilities and Their Fragility under Compression

# Emergent Capabilities and Their Fragility under Compression Parent: [[Neural Scaling Laws & the Compression-Quality Tradeoff]] Some capabilities in large language models appear discontinuously. Below a certain scale the model cannot do a task at all; above that scale it can. Chain-of-thought reasoning, multi-step arithmetic, code generation from natural language — all showed up as step functions, not smooth ramps. Whether these emergences are genuine or artefacts of how we measure is a live debate. Either way, the empirical fact that benchmark curves look discontinuous has shaped how the field thinks about model size. The fragility under compression is the part that matters for deployment. When you compress a model by 4x, benchmark averages move gently. The long tail moves less gently. Reasoning tasks, multi-hop retrieval, and low-resource language performance drop faster than the averaged loss suggests. Emergent capabilities appear to be the first to break, and they break in ways that are hard to see until a user trips over them. The explanation is probably something like this. Emergence corresponds to the point at which a certain circuit inside the model becomes reliable enough to be used. That circuit is encoded in a specific set of parameters. Compression methods that reduce parameters uniformly will, on average, damage the critical circuit as much as anything else — but the circuit needs all its components to function, so the damage is binary for that capability. The practical consequence: benchmark-averaged accuracy is a lagging indicator of compression damage. If you care about reasoning or long-tail performance, measure those directly. ## Related - [[Kaplan Scaling Laws (2020)]] - [[Chinchilla Scaling Laws (2022)]] - [[Lottery Ticket Hypothesis]] --- Tags: #ai #theory #compression #kp