Prefill and Decode - Mind Palace

Prefill and decode are the two phases of LLM [[Inference]], each with completely different computational characteristics. **Prefill** processes your input prompt. The model reads all input tokens in parallel, building the key-value cache. Compute-bound and highly parallelizable. You're doing lots of matrix multiplications across many tokens at once. **Decode** generates output tokens one at a time. Each new token depends on all previous tokens. Sequential and memory-bound. You're constantly fetching model weights and the KV cache to generate the next token, but only doing computation for a single token. This is why inference is disaggregating. Prefill wants high compute density and can tolerate higher memory latency because you're batching across tokens. Decode wants ultra-high [[Memory Bandwidth]] because you're memory-limited with little parallelism to hide latency. Optimal architectures differ completely. Prefill benefits from high-capacity memory ([[GDDR]]) to fit massive [[Context Window]] prompts. Decode benefits from high-bandwidth memory ([[SRAM]]) to minimize per-token latency. Nvidia's three Rubin variants target exactly this split. The economics flip too. Prefill can run high [[Batch Size]] to amortize costs. Decode often runs low batch sizes for latency-sensitive applications, making per-token costs higher but users willing to pay for speed. --- #deeptech #inference #firstprinciple Related: [[Inference]] | [[Memory Bandwidth]] | [[SRAM]] | [[GDDR]] | [[Context Window]] | [[Batch Size]] | [[Nvidia-Groq - Inference Disaggregation Play]]