Kaplan Scaling Laws (2020)

# Kaplan Scaling Laws (2020) Parent: [[Neural Scaling Laws & the Compression-Quality Tradeoff]] The OpenAI paper that made scaling an explicit, measurable phenomenon. Kaplan and colleagues showed that cross-entropy loss on language modelling decreases as a smooth power law in three variables: parameters, training tokens, and compute. When any one is the bottleneck, loss follows a clean line on a log-log plot. The most consequential finding was the recipe: for a fixed compute budget, the optimal strategy was to make the model bigger, not to train a smaller model for longer. Scale up parameters aggressively, scale up tokens modestly. This directly shaped the next two years of frontier model design — GPT-3 and its imitators were built on the Kaplan recipe. The prediction worked remarkably well. Losses landed where the curves said they would. Capabilities that were invisible at small scale emerged predictably at large scale. This was when "scale is the answer" stopped being a vibe and started being a thesis with error bars. The paper was also, in retrospect, partly wrong — specifically on the compute-optimal balance between parameters and tokens. Chinchilla corrected this two years later. But the existence and shape of the scaling laws themselves have held up. The underlying empirical fact — that loss is a predictable function of scale across many orders of magnitude — is one of the most important findings in modern machine learning. ## Related - [[Chinchilla Scaling Laws (2022)]] - [[Emergent Capabilities and Their Fragility under Compression]] --- Tags: #ai #scaling #theory #kp