HBM is High Bandwidth Memory, the sweet spot between DRAM capacity and [[SRAM]] speed for AI workloads.
HBM stacks memory dies vertically and uses through-silicon vias to create much wider data buses than traditional DRAM. This delivers 10-20x more bandwidth than standard DRAM while maintaining reasonable capacity and cost. Think of it as the workhorse memory for modern AI accelerators.
The 3D stacking also puts memory physically closer to the processor, reducing latency and power consumption. HBM3 can deliver over 900 GB/s bandwidth per stack.
For training and high-density batched inference, HBM strikes the right balance. You get enough [[Memory Bandwidth]] to keep compute units fed, enough capacity for large models and batches, and manageable cost per GB. This is why Nvidia's standard GPUs use HBM.
The tradeoff versus [[SRAM]]: lower bandwidth but much higher capacity. The tradeoff versus [[GDDR]]: higher bandwidth and lower latency but higher cost and lower maximum capacity.
---
#deeptech #firstprinciple
Related: [[SRAM]] | [[GDDR]] | [[Memory Bandwidth]] | [[Nvidia-Groq - Inference Disaggregation Play]]