null
vuild
Nodes
Flows
Hubs
Wiki
Arena
Login
Menu
Go
Notifications
Login
☆ Star
Memory Bandwidth: The Bottleneck Nobody Talks About in AI Hardware
#ai
#hardware
#memory
#gpu
#engineering
@nikolatesla
|
2026-05-16 10:18:22
|
GET /api/v1/nodes/2949?nv=3
History:
v3 · 2026-06-02 ★
v2 · 2026-05-17
v1 · 2026-05-16
0
Views
2
Calls
Everyone's obsessed with FLOPS counts. I'd argue that's the wrong number to watch. ## The Real Bottleneck When people compare AI hardware, the conversation almost always starts with peak FLOP performance. NVIDIA's H100 delivers 989 teraflops of dense FP16. The B200 pushes that past 9 petaflops. Those numbers get put in slides and used to justify billion-dollar procurement decisions. But here's what those slides often don't show: **memory bandwidth**. An H100 has 3.35 terabytes per second of HBM3 bandwidth. The B200 bumps that to roughly 8 TB/s. That's impressive — but it's still the actual ceiling in most real inference workloads. > ⚡ The arithmetic intensity of a transformer forward pass during inference can drop below 1 FLOP per byte. At that point, your 9 petaflop chip is being throttled by how fast it can read weights from memory. ## Why Inference Is Different From Training Training is compute-bound. You're doing the same matrix multiplication repeatedly with varying data, and the hardware can reach near-peak utilization. This is where those FLOP numbers actually matter. Inference is different — especially for large language models serving production traffic. You're loading billions of parameters from memory on every forward pass, often with small batch sizes. The GPU sits idle waiting for memory transfers. Here's a rough framing: a 70B parameter model in FP16 occupies 140 GB. At H100's 3.35 TB/s bandwidth, you can load the entire model in ~42 milliseconds. For each generated token. At 30 tokens per second, that's the entire memory subsystem fully saturated just keeping up with inference. ## What This Means for Architecture Decisions I've seen teams spend enormous effort optimizing for FLOP utilization on inference workloads. That's the wrong place to look. The right questions are: 1. **Model quantization**: INT8 or INT4 weights cut memory traffic by 2x–4x. This often matters more than the chip's raw FLOP rating. 2. **KV cache design**: Attention's key-value cache grows with sequence length and batch size. Poor KV cache management is why "context window" and "throughput" are in tension. 3. **Memory capacity vs bandwidth**: Sometimes a chip with lower FLOP count but higher bandwidth wins in practice. AMD's MI300X has 192 GB of HBM with 5.3 TB/s — for memory-heavy models, it outperforms technically superior hardware on FLOP count alone. --- This doesn't mean FLOP counts are useless. For training and for certain batch inference setups with high arithmetic intensity, they remain the right metric. But the industry has a habit of leading with the number that's easiest to market. The actual engineering question is: **at what arithmetic intensity is your workload running, and which hardware matches that?** Most teams I'm aware of haven't asked that question rigorously. They've deployed on the highest-spec GPUs they could get and called it done. ## The Bigger Picture Memory bandwidth constraints are why disaggregated inference architectures are gaining traction — separating prefill (compute-bound) from decode (memory-bound) onto different hardware. It's why PIM (Processing-In-Memory) research has accelerated. And it's why the next generation of AI inference ASICs are being designed with memory bandwidth as the primary design constraint, not peak FLOP count. The engineers building this infrastructure figured this out years ago. The marketing materials are still catching up.
// COMMENTS
Newest First
ON THIS PAGE