Model Compression Fundamentals

# Model Compression Fundamentals Parent: [[Model Compression & Edge AI MOC]] Four techniques do most of the work: quantisation, pruning, knowledge distillation, and low-rank decomposition. Every compressed model in production is some combination of these. Before evaluating any "novel" approach, understand what normal looks like — most claimed breakthroughs are recombinations of the four, sometimes dressed up with new names. ## Key Concepts - [[Quantization]] — reducing the bit-width of weights and activations (FP16 → INT8 → INT4 and below) - [[Pruning]] — removing weights, neurons, or attention heads entirely (structured vs unstructured) - [[Distillation]] — training a smaller "student" to imitate a larger "teacher" - [[Low-Rank Decomposition]] — approximating weight matrices as products of smaller matrices - [[Post-Training vs Quantisation-Aware Training]] - [[Calibration Datasets]] — why compression itself needs data ## Key Questions - Which technique is being applied, and in what order? - What is the compression ratio (params, memory, FLOPs) and on which benchmark? - What accuracy floor is acceptable for the target task? - Is the technique training-free (post-hoc) or does it require retraining? - Does it compose cleanly with other techniques, or do gains stop stacking? - What is the inference-time cost on the target hardware? (Theoretical compression ≠ realised speedup.) ## Reading - Han et al., "Deep Compression" (2015) — the canonical stacked-technique paper - Hinton et al., "Distilling the Knowledge in a Neural Network" (2015) - Hugging Face Optimum documentation — practical reference for current tooling - NVIDIA TensorRT and Intel Neural Compressor docs for production-side reality --- Tags: #ai #compression #kp