# von Neumann Bottleneck
Parent: [[Analog In-Memory Computing]]
The fundamental architectural constraint of conventional computing: memory and compute are physically separate, connected by a bus with finite bandwidth, and the bus becomes the limiting factor for any memory-heavy workload. Every byte of data the compute unit touches must travel across the gap. For most programs this is fine. For matrix multiplication on billion-parameter models, it is the entire problem.
The numbers are stark. On a modern GPU, a single floating-point multiply-accumulate takes on the order of a picojoule. Moving a single byte from HBM to the compute unit takes tens to hundreds of picojoules. The compute is essentially free; the data movement dominates both energy and latency. This is why GPU utilisation during LLM inference is typically memory-bandwidth-bound, not compute-bound.
Three families of approaches try to mitigate the bottleneck. Near-memory computing places simple compute logic next to DRAM, shortening the walk. Processing-in-memory embeds compute inside the memory controller or the memory die itself. In-memory computing — the most radical — makes the memory array itself perform the computation, dissolving the gap entirely.
Analog in-memory computing is the sharpest version of the third approach. It is not an optimisation of the von Neumann architecture. It is an exit from it.
## Related
- [[Crossbar Arrays]]
- [[Analog In-Memory Computing]]
---
Tags: #hardware #architecture #kp