Context window is how much text an LLM can process at once, measured in tokens.
Think of it as the model's working memory. GPT-4 has a 128k token context window, roughly 96,000 words. Claude 3.5 Sonnet goes to 200k tokens. Gemini 1.5 Pro reaches 1 million tokens.
Bigger context windows let models handle longer documents, maintain conversation history, and reason over more information simultaneously. The tradeoff: memory requirements scale linearly with context length, and attention mechanisms scale quadratically with computation.
For [[Inference]], context window creates distinct architectural requirements during [[Prefill and Decode]]. During prefill, you need enough memory capacity to store the entire context before processing begins. This is why Nvidia's Rubin CPX uses high-capacity [[GDDR]] memory. You're optimizing for fitting massive contexts, not for speed.
During decode, the context is already loaded. Now you're generating tokens one by one while referencing that stored context. Here, [[Memory Bandwidth]] matters more than capacity because you're constantly fetching from the key-value cache.
The future: models with million-token contexts will need specialized prefill accelerators with huge memory pools, separate from the decode accelerators optimizing for low-latency generation.
---
#deeptech #inference #firstprinciple
Related: [[Inference]] | [[Prefill and Decode]] | [[GDDR]] | [[Memory Bandwidth]] | [[Nvidia-Groq - Inference Disaggregation Play]]