State-Space Models (Mamba)

# State-Space Models (Mamba) Parent: [[Efficient Transformer Architectures for Edge]] A non-attention architecture for sequence modelling that has quietly become the most credible challenger to transformers for long-context workloads. Mamba, due to Gu and Dao, is built on structured state-space models — a formalism borrowed from control theory — with a critical innovation that makes the state transitions depend on the input rather than being fixed. The advantage is architectural. Attention has quadratic cost in sequence length because every token attends to every other token. State-space models have linear cost: they maintain a fixed-size hidden state that is updated recurrently as new tokens arrive. At long context, this becomes the difference between running the model and not. A 128K context transformer is a feat of engineering; a 128K context Mamba is just a Mamba. The cost of the trick is that the fixed-size hidden state has to compress all the past. Attention can look back at any specific prior token exactly; Mamba can only look at the current state. For tasks that need long-range exact retrieval — finding a specific number in a long document, for instance — pure Mamba struggles. Hybrid architectures that interleave attention and state-space blocks have emerged to get the best of both. For edge deployment, Mamba is interesting because its memory footprint during generation does not grow with sequence length. The KV cache problem simply does not exist. This matters a great deal on memory-constrained hardware. Whether Mamba displaces attention, coexists with it, or remains a niche is still being settled. ## Related - [[Grouped Query Attention (GQA)]] - [[KV Cache Compression and Eviction]] - [[FlashAttention and Memory-Bound Attention]] --- Tags: #ai #transformers #mamba #kp