Batch size is how many requests you process simultaneously on the same hardware. Higher batch sizes mean better GPU utilization and lower cost per request. You're amortizing the fixed cost of loading model weights across many requests. This is why inference providers optimize for batching. The tradeoff: batching increases latency. If you're processing 100 requests together, some requests wait for others to finish. For many applications, this is fine. For real-time [[Agentic AI]] reasoning or conversational UI, waiting is unacceptable. This creates a fundamental split in [[Inference]] economics. High-batch workloads (background processing, batch API calls) optimize for throughput per dollar. Low-batch workloads (real-time chat, agentic reasoning) optimize for latency per request. [[SRAM]] architectures excel at low-batch inference. They can't batch as effectively as GPU/DRAM systems because SRAM capacity is limited. But they deliver drastically lower per-token latency for individual requests. The market signal from Cerebras and Groq: **users are willing to pay for speed** even at higher per-token cost. Traditional GPU inference systems optimize for high-batch throughput. They're economically optimal when you can aggregate enough requests to keep the GPU busy. [[Memory Bandwidth]] becomes less critical because you're parallelizing across many requests. --- #deeptech #inference #firstprinciple Related: [[Inference]] | [[SRAM]] | [[Memory Bandwidth]] | [[Prefill and Decode]] | [[Nvidia-Groq - Inference Disaggregation Play]]