Neural Scaling Laws & the Compression-Quality Tradeoff

# Neural Scaling Laws & the Compression-Quality Tradeoff Parent: [[Model Compression & Edge AI MOC]] Why does a compressed 7B model sometimes match an uncompressed 1.5B? The naive answer — "the 7B was overparameterised" — is correct but incomplete. The precise answer sits in how loss scales with parameters, data, and compute. Chinchilla showed that most frontier models were undertrained for their size. The extra parameters were carrying information they could not efficiently access. That same insight is what makes compression possible without quality collapse: a well-designed compression preserves the accessible signal and discards the parameters that were along for the ride. The compression curve is not smooth. Capabilities drop out at different ratios, and the ones that drop first are often the most interesting ones. ## Key Concepts - [[Kaplan Scaling Laws (2020)]] — the original "bigger is better" findings - [[Chinchilla Scaling Laws (2022)]] — the compute-optimal correction (tokens-per-parameter) - [[Lottery Ticket Hypothesis]] — sparse subnetworks that match the full network - [[Effective Parameter Count]] vs. nominal parameter count - [[Overparameterisation and Implicit Regularisation]] - [[Emergent Capabilities and Their Fragility under Compression]] - [[Distillation as Implicit Scaling Law Exploitation]] ## Key Questions - Is this model undertrained (more parameters than tokens support) and therefore compressible? - Which capabilities survive compression, and which collapse? (Reasoning and long-context tend to be fragile.) - At what compression ratio does the quality curve inflect? - Does compression preserve performance on long-tail and out-of-distribution tasks, or only on benchmark averages? - Is the comparison fair — compressed large model vs. a properly-trained native-small baseline? - How much of the "compression win" comes from the compression method itself vs. from distillation on good data? ## Reading - Kaplan et al., "Scaling Laws for Neural Language Models" (2020) - Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022) - Frankle & Carbin, "The Lottery Ticket Hypothesis" (2018) - Gadre et al., "Language Models Scale Reliably with Over-Training and on Downstream Tasks" (2024) - Any recent benchmark that compares compressed large vs. native small on matched compute --- Tags: #ai #scaling #theory #kp