FlashAttention and Memory-Bound Attention

# FlashAttention and Memory-Bound Attention Parent: [[Efficient Transformer Architectures for Edge]] A reimplementation of the standard attention operation that recognises the same architectural truth everything else on the edge has to grapple with: attention is memory-bound, not compute-bound. The naive implementation of softmax attention materialises a sequence-length-by-sequence-length matrix in GPU HBM. For long contexts, that matrix becomes the bottleneck — not the multiplication cost, but the cost of shuttling it into and out of memory. FlashAttention, due to Dao et al., reorganises the computation so that the attention matrix is never fully materialised. The sequence is processed in tiles small enough to fit in the GPU's on-chip SRAM. Softmax normalisation is computed incrementally using an online algorithm. The result is mathematically equivalent to standard attention but with dramatically less memory bandwidth consumed per token. The practical impact has been enormous. FlashAttention is the default attention kernel in virtually every serious inference framework. FlashAttention-2 pushed the optimisation further by rebalancing the work between threads. FlashAttention-3, tuned for Hopper GPUs, exploits asynchronous tensor core operations. The broader lesson: on modern hardware, "compute" is cheap and "moving data" is expensive. The best kernels are the ones that minimise data movement, even if they do more arithmetic. This principle applies far beyond attention — it is why structured sparsity beats unstructured sparsity, why quantisation pays off, and why the whole efficient-transformer design discipline exists. ## Related - [[Grouped Query Attention (GQA)]] - [[KV Cache Compression and Eviction]] - [[von Neumann Bottleneck]] --- Tags: #ai #transformers #hardware #kp