# Efficient Transformer Architectures for Edge
Parent: [[Model Compression & Edge AI MOC]]
The bridge between software compression and hardware reality. On-device deployment is a set of hard constraints: memory bandwidth, cache size, thermal envelope, latency budget. The architectures that win on the edge are not the ones with the best loss — they are the ones that compose well with the memory hierarchy of the target chip.
Every design choice in an efficient transformer is a hardware decision in disguise. Grouped query attention is a KV cache decision. Sub-4-bit quantisation is a memory-bandwidth decision. Speculative decoding is a latency decision masquerading as a sampling trick.
## Key Concepts
- [[Grouped Query Attention (GQA)]] and [[Multi-Query Attention (MQA)]] — attention variants that shrink the KV cache
- [[KV Cache Compression and Eviction]]
- [[Sliding Window Attention]] and [[Streaming Attention]]
- [[Structured vs Unstructured Pruning]] — only the structured kind gives real hardware speedups
- [[Sub-4-Bit Quantisation Failure Modes]] — why the curve cliffs
- [[FlashAttention and Memory-Bound Attention]]
- [[Speculative Decoding]] — architectural tricks that hide latency
- [[State-Space Models (Mamba)]] — a non-attention alternative with different hardware profile
## Key Questions
- Is the compression technique actually realisable on the target hardware, or only on paper?
- Does the architecture match the memory hierarchy (SRAM, DRAM, HBM) of the chip?
- What is the bottleneck — compute or memory bandwidth? For inference, it is almost always bandwidth.
- How does the KV cache grow with context length, and can it be compressed without quality loss?
- What is the cliff point for quantisation on this architecture? (For most, 4-bit is the floor before engineering heroics.)
- How does throughput change at batch size 1 (interactive) vs. large batch (batch inference)?
## Reading
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models" (2023)
- Dao et al., "FlashAttention" (2022) and "FlashAttention-2" (2023)
- Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2023)
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
- Any recent MLPerf inference benchmark report
---
Tags: #ai #edge #hardware #transformers #kp