von Neumann Bottleneck

# von Neumann Bottleneck Parent: [[Analog In-Memory Computing]] The fundamental architectural constraint of conventional computing: memory and compute are physically separate, connected by a bus with finite bandwidth, and the bus becomes the limiting factor for any memory-heavy workload. Every byte of data the compute unit touches must travel across the gap. For most programs this is fine. For matrix multiplication on billion-parameter models, it is the entire problem. The numbers are stark. On a modern GPU, a single floating-point multiply-accumulate takes on the order of a picojoule. Moving a single byte from HBM to the compute unit takes tens to hundreds of picojoules. The compute is essentially free; the data movement dominates both energy and latency. This is why GPU utilisation during LLM inference is typically memory-bandwidth-bound, not compute-bound. Three families of approaches try to mitigate the bottleneck. Near-memory computing places simple compute logic next to DRAM, shortening the walk. Processing-in-memory embeds compute inside the memory controller or the memory die itself. In-memory computing — the most radical — makes the memory array itself perform the computation, dissolving the gap entirely. Analog in-memory computing is the sharpest version of the third approach. It is not an optimisation of the von Neumann architecture. It is an exit from it. ## Related - [[Crossbar Arrays]] - [[Analog In-Memory Computing]] --- Tags: #hardware #architecture #kp