Chinchilla Scaling Laws (2022)

# Chinchilla Scaling Laws (2022) Parent: [[Neural Scaling Laws & the Compression-Quality Tradeoff]] DeepMind's correction to Kaplan. By training hundreds of models across a grid of parameter and token counts, Hoffmann et al. showed that the compute-optimal ratio was not what Kaplan had said. For a fixed compute budget, parameters and training tokens should scale together — roughly one-to-one — not with parameters growing much faster than tokens. The concrete recipe: for every doubling of parameters, double the training tokens. The rule of thumb that emerged is roughly 20 training tokens per model parameter for compute-optimal training. The paper demonstrated this by training Chinchilla (70B parameters, 1.4T tokens) and showing it outperformed Gopher (280B parameters, 300B tokens) despite using the same compute budget. The smaller, properly-trained model won across nearly every benchmark. This was the empirical blow that forced the field to reassess. The implication for compression is huge. Most pre-Chinchilla frontier models were undertrained — they had more parameters than their training data could fill with signal. That is what made them compressible without quality loss: the extra parameters were structurally present but informationally empty. Post-Chinchilla training recipes pack each parameter with more signal, which is why compression ratios on modern well-trained models are often lower than on older ones. Chinchilla scaling also came with a caveat that the field has since explored. For inference-heavy deployments, training past the compute-optimal point — using more tokens than the law prescribes — is often worth it, because you pay the training cost once but the inference cost forever. ## Related - [[Kaplan Scaling Laws (2020)]] - [[Emergent Capabilities and Their Fragility under Compression]] --- Tags: #ai #scaling #theory #kp