SRAM is Static Random-Access Memory, the fastest type of memory but also the most expensive and lowest capacity.
Unlike DRAM which needs constant refreshing to retain data, SRAM holds data as long as power is supplied. This makes it much faster for read/write operations. The tradeoff: SRAM takes up more physical space per bit, making it impractical for large memory capacities.
In AI inference, SRAM's speed advantage matters when memory bandwidth is the bottleneck. For workloads like decode in [[Inference]], where you're generating one token at a time with low batch sizes, SRAM's ability to feed data to compute units faster than DRAM can dramatically reduce latency.
The economics flip based on workload. High [[Batch Size]] workloads amortize slower memory across many requests. Low batch, high speed workloads (like [[Agentic AI]] reasoning) benefit from SRAM's raw speed despite higher cost per token.
---
#deeptech #firstprinciple
Related: [[HBM]] | [[GDDR]] | [[Memory Bandwidth]] | [[Nvidia-Groq - Inference Disaggregation Play]]