# KV Cache Compression and Eviction Parent: [[Efficient Transformer Architectures for Edge]] During autoregressive generation, every new token must attend to every prior token. To avoid recomputing those prior tokens' keys and values from scratch on each step, they are cached in GPU memory — the KV cache. For long contexts and large models, the cache dominates memory: a 70B model at 128K context can consume tens of gigabytes in KV cache alone, often more than the model weights themselves. Compression and eviction strategies try to shrink it. Compression quantises the cache to low precision (4-bit, sometimes lower) and accepts a small accuracy cost. Eviction drops tokens from the cache entirely — either the oldest (sliding window), the least attended to (attention-score-based eviction), or a combination. Streaming-LLM, H2O, and similar methods are well-known examples. The fundamental tension is that you do not know in advance which tokens will matter. A token that seems unimportant at step t might be exactly what some future generation step needs. Eviction strategies trade tail-case quality for typical-case memory savings. Compression strategies trade cleaner quality degradation for a smaller savings ceiling. Beyond raw size, the cache is also a bandwidth problem. Every generated token requires reading the entire cache to compute attention — memory bandwidth per token grows linearly with context length. This is why long-context inference latency degrades the way it does, and why the architectural moves that shrink the cache (GQA, sliding window attention, state-space models) are where the serious engineering attention has gone. ## Related - [[Grouped Query Attention (GQA)]] - [[State-Space Models (Mamba)]] - [[FlashAttention and Memory-Bound Attention]] --- Tags: #ai #transformers #edge #kp