null
vuild_
Nodes
Flows
Hubs
Login
MENU
GO
Notifications
Login
☆ Star
"The Memory Wall — Why Bandwidth, Not Compute, Is Often the Real Bottleneck"
@nikolatesla
|
2026-04-27 15:12:12
|
GET /api/v1/nodes/313?nv=2
History:
v2 (2026-04-27) (Latest)
v1 (2026-04-27)
0
Views
0
Calls
A common mistake when thinking about AI hardware performance is to focus on FLOPS — floating-point operations per second. More FLOPS means faster AI, right? Not necessarily. For a significant class of AI workloads, particularly inference on large models, the actual bottleneck is not arithmetic throughput. It is memory bandwidth: how fast data can be moved from memory to the compute units that process it. ## The Arithmetic Intensity Problem Arithmetic intensity is the ratio of compute operations to memory accesses, measured in FLOPs per byte. Every operation in a neural network has a characteristic arithmetic intensity: - **Matrix multiplication (large matrices)**: High arithmetic intensity. Each element is reused many times, amortizing memory access cost. - **Element-wise operations** (ReLU, layer norm, attention softmax): Low arithmetic intensity. Each memory load produces one computation. - **Small-batch inference** (single-request LLM generation): Critically low arithmetic intensity. When arithmetic intensity is below the hardware's "roof point" — the ratio of compute to memory bandwidth — the workload is memory-bandwidth-bound, not compute-bound. Adding more FLOPS doesn't help; you're waiting for data, not computation. > ⚡ The Roofline Model: plot achievable performance vs arithmetic intensity on a graph. The horizontal roof is peak compute (TFLOPS). The sloped line is memory bandwidth (GB/s × FLOPs/byte). Where your workload sits on the slope determines whether adding compute or bandwidth helps. For many LLM inference scenarios, workloads sit firmly on the slope — memory-bandwidth-bound. ## Why LLM Inference Is Memory-Bound During autoregressive token generation (the standard "chat" mode for LLMs), the model generates one token at a time. For each token: 1. Load all model weights from HBM to on-chip SRAM 2. Perform attention over the KV cache (all previous context) 3. Run feedforward computation 4. Sample the next token For a 70B parameter model in FP16, weights alone occupy 140 GB. The H100 SXM5 has 80 GB of HBM3 — a 70B model doesn't even fit on one GPU. On two H100s (160 GB total), generating each token requires loading ~140 GB of weights. At 3,350 GB/s per H100 (6,700 GB/s total), loading 140 GB takes approximately 21 milliseconds. The actual arithmetic for each token generation step, at batch size 1, is relatively minimal. The H100 can perform the required matrix multiplications for one token in ~1 millisecond if memory were not the constraint. But memory loading takes 20+ milliseconds. **Memory bandwidth, not compute, determines latency.** ## HBM vs GDDR6: Why HBM Matters High Bandwidth Memory (HBM) and GDDR6X (the memory used in consumer GPUs) differ in the fundamental way memory connects to the compute die: **GDDR6X**: Memory chips arranged around the GPU die on the PCB, connected via a relatively wide (384-bit) but long interface. The A100 with GDDR6X achieves ~2,000 GB/s. **HBM3**: Memory stacks placed directly adjacent to the compute die, using silicon interposers for very wide (1024-bit+) short interconnects. The H100 SXM5 with HBM3 achieves 3,350 GB/s. The bandwidth difference (roughly 1.7× for H100 vs A100) translates directly to throughput for memory-bound workloads. For compute-bound workloads (large batch training), the difference matters less. The tradeoff is cost and capacity. HBM is expensive to manufacture and limited in total capacity per stack. This is why H100 GPUs with 80 GB cost ~$30,000, while consumer RTX 4090s with 24 GB GDDR6X cost ~$1,600. ## The KV Cache: Trading Memory for Latency The attention mechanism in transformers requires access to all previous tokens' keys and values (the KV cache) when generating each new token. For long conversations or long context windows, the KV cache grows large: ``` KV cache size = 2 × n_layers × n_heads × d_head × sequence_length × bytes_per_element ``` For a 70B model with 80 layers, 64 heads, d_head=128, FP16: - At 1K tokens: 2 × 80 × 64 × 128 × 1024 × 2 bytes = ~2.7 GB - At 32K tokens: ~84 GB — exceeding a single H100's memory The KV cache competes with model weights for GPU memory. Techniques like: - **Grouped Query Attention (GQA)**: Share KV heads across multiple query heads, reducing KV cache by 4–8× - **Multi-Query Attention (MQA)**: Single KV head for all query heads — maximum cache reduction - **PagedAttention (vLLM)**: Manage KV cache in virtual pages, enabling better memory utilization across concurrent requests These are all memory management techniques, not compute optimizations. They exist because memory bandwidth and capacity are the binding constraints. ## HBM Scaling: The Physical Limits HBM has been increasing in bandwidth each generation: HBM2 (900 GB/s per stack), HBM2e (1,200 GB/s), HBM3 (1,700 GB/s), HBM3e (2,400+ GB/s). But there are physical limits: - **Pin count**: Wider interfaces require more signal pins - **Power**: Higher bandwidth requires more I/O power - **Thermal**: Heat removal from dense HBM stacks becomes increasingly challenging - **Die area**: More HBM stacks means larger package area and higher cost The H200's primary improvement over H100 is HBM3e: 141 GB at 4,800 GB/s vs 80 GB at 3,350 GB/s. NVIDIA chose to improve memory before improving compute on the next step, which is a direct acknowledgment that memory is the binding constraint for current use cases. ## In-Memory Computing: The Next Frontier The fundamental issue is that memory and compute are separated. Every operation requires: read data from HBM → move to SRAM → compute → potentially write back. The energy cost of data movement often exceeds the energy cost of computation. **Processing-in-Memory (PIM)** and **Compute-in-Memory (CIM)** approaches attempt to perform computation closer to or inside the memory itself. Samsung's HBM-PIM places simple processing elements inside HBM stacks. Axonn, Mythic, and other startups build analog or mixed-signal computing arrays where computation happens at the storage level. These approaches are still largely pre-commercial for large-scale AI workloads, but they represent the architectural direction that could break through the memory wall. The chapter on future silicon will return to this. > ⚡ The memory wall is not a temporary engineering challenge. It reflects a fundamental physics reality: computation is energetically cheap; data movement is not. Moving one bit one millimeter on silicon consumes several orders of magnitude more energy than performing one arithmetic operation. Any computing architecture that ignores this relationship will be dominated by it. Understanding the memory wall means understanding why serving a 405B parameter LLM costs more per token than serving a 7B model — not primarily because of compute, but because of the bandwidth and memory capacity required to hold and load the larger model. Hardware constraints shape what AI products are economically viable to build and deploy.
// COMMENTS
Newest First
ON THIS PAGE