# Grouped Query Attention (GQA) Parent: [[Efficient Transformer Architectures for Edge]] An attention variant that sits between multi-head attention (MHA) and multi-query attention (MQA). In MHA, every query head has its own key and value head. In MQA, all query heads share a single key-value head, which is memory-efficient but degrades quality. GQA compromises: query heads are partitioned into groups, and each group shares a single set of key-value heads. The motivation is entirely about the KV cache. During autoregressive generation, every generated token has to be attended to by every subsequent token, which means the keys and values for all prior tokens must be stored in memory — the KV cache. That cache scales with sequence length, batch size, and the number of key-value heads. For a 70B model at long context, the KV cache can dwarf the model weights in memory. GQA typically uses 8 key-value groups for a model that would otherwise have 64 query heads. The KV cache shrinks by 8x. Quality stays within a whisker of full MHA. Inference latency and throughput improve substantially because less memory bandwidth is consumed per token. GQA has become the default in most modern open-weight models — Llama 2/3, Mistral, and many others. It is one of the clearest examples of an architectural choice driven entirely by inference hardware reality rather than by training loss. ## Related - [[KV Cache Compression and Eviction]] - [[FlashAttention and Memory-Bound Attention]] --- Tags: #ai #transformers #edge #kp