Efficient Transformer Architectures for Edge

# Efficient Transformer Architectures for Edge Parent: [[Model Compression & Edge AI MOC]] The bridge between software compression and hardware reality. On-device deployment is a set of hard constraints: memory bandwidth, cache size, thermal envelope, latency budget. The architectures that win on the edge are not the ones with the best loss — they are the ones that compose well with the memory hierarchy of the target chip. Every design choice in an efficient transformer is a hardware decision in disguise. Grouped query attention is a KV cache decision. Sub-4-bit quantisation is a memory-bandwidth decision. Speculative decoding is a latency decision masquerading as a sampling trick. ## Key Concepts - [[Grouped Query Attention (GQA)]] and [[Multi-Query Attention (MQA)]] — attention variants that shrink the KV cache - [[KV Cache Compression and Eviction]] - [[Sliding Window Attention]] and [[Streaming Attention]] - [[Structured vs Unstructured Pruning]] — only the structured kind gives real hardware speedups - [[Sub-4-Bit Quantisation Failure Modes]] — why the curve cliffs - [[FlashAttention and Memory-Bound Attention]] - [[Speculative Decoding]] — architectural tricks that hide latency - [[State-Space Models (Mamba)]] — a non-attention alternative with different hardware profile ## Key Questions - Is the compression technique actually realisable on the target hardware, or only on paper? - Does the architecture match the memory hierarchy (SRAM, DRAM, HBM) of the chip? - What is the bottleneck — compute or memory bandwidth? For inference, it is almost always bandwidth. - How does the KV cache grow with context length, and can it be compressed without quality loss? - What is the cliff point for quantisation on this architecture? (For most, 4-bit is the floor before engineering heroics.) - How does throughput change at batch size 1 (interactive) vs. large batch (batch inference)? ## Reading - Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models" (2023) - Dao et al., "FlashAttention" (2022) and "FlashAttention-2" (2023) - Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2023) - Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023) - Any recent MLPerf inference benchmark report --- Tags: #ai #edge #hardware #transformers #kp