Memory bandwidth is how fast you can move data between memory and compute. For many AI workloads, it's the actual bottleneck. Think of it this way: you can have the fastest processor in the world, but if you can't feed it data quickly enough, it sits idle. Measured in GB/s (gigabytes per second). Modern AI accelerators need hundreds to thousands of GB/s to stay fed. The memory hierarchy matters: [[SRAM]] has the highest bandwidth but lowest capacity. [[HBM]] balances bandwidth and capacity. [[GDDR]] prioritizes capacity over bandwidth. Each optimizes for different workload patterns. For [[Inference]] specifically, bandwidth requirements depend on the phase. During prefill, you're processing large contexts in parallel, so you can batch effectively and hide memory latency. During decode, you're generating one token at a time, often with small batch sizes. Here, memory bandwidth becomes critical because you're constantly fetching weights with little parallelism to hide latency. This is why [[SRAM]] architectures excel at decode. They can deliver 10-100x more bandwidth than DRAM-based systems, dramatically reducing per-token latency for low-batch workloads. --- #deeptech #firstprinciple Related: [[SRAM]] | [[HBM]] | [[GDDR]] | [[Batch Size]] | [[Prefill and Decode]] | [[Nvidia-Groq - Inference Disaggregation Play]]