# Model Compression Fundamentals
Parent: [[Model Compression & Edge AI MOC]]
Four techniques do most of the work: quantisation, pruning, knowledge distillation, and low-rank decomposition. Every compressed model in production is some combination of these. Before evaluating any "novel" approach, understand what normal looks like — most claimed breakthroughs are recombinations of the four, sometimes dressed up with new names.
## Key Concepts
- [[Quantization]] — reducing the bit-width of weights and activations (FP16 → INT8 → INT4 and below)
- [[Pruning]] — removing weights, neurons, or attention heads entirely (structured vs unstructured)
- [[Distillation]] — training a smaller "student" to imitate a larger "teacher"
- [[Low-Rank Decomposition]] — approximating weight matrices as products of smaller matrices
- [[Post-Training vs Quantisation-Aware Training]]
- [[Calibration Datasets]] — why compression itself needs data
## Key Questions
- Which technique is being applied, and in what order?
- What is the compression ratio (params, memory, FLOPs) and on which benchmark?
- What accuracy floor is acceptable for the target task?
- Is the technique training-free (post-hoc) or does it require retraining?
- Does it compose cleanly with other techniques, or do gains stop stacking?
- What is the inference-time cost on the target hardware? (Theoretical compression ≠ realised speedup.)
## Reading
- Han et al., "Deep Compression" (2015) — the canonical stacked-technique paper
- Hinton et al., "Distilling the Knowledge in a Neural Network" (2015)
- Hugging Face Optimum documentation — practical reference for current tooling
- NVIDIA TensorRT and Intel Neural Compressor docs for production-side reality
---
Tags: #ai #compression #kp