NVIDIA Blackwell Architecture: What Actually Changed from Hopper and Why It Matters

208 billion transistors. That's what NVIDIA packed into the Blackwell B200 GPU — more than double Hopper's 80 billion.

The numbers are staggering. But transistor count alone doesn't explain why Blackwell represents a different category of compute, not just an incremental update.

## The Problem Hopper Left Unsolved

Hopper (H100) was remarkable for training large language models. It delivered 3.35 TB/s of memory bandwidth and supported transformer engine operations at FP8 precision. At the time of its launch in 2022, nothing came close.

But as inference demands exploded — running finished models at scale, not just training them — the bottleneck shifted. LLM inference is bound by memory bandwidth and memory capacity, not raw compute. An H100 with 80GB of HBM3 hits its ceiling fast when you're trying to serve a 70B parameter model to thousands of users simultaneously.

---

## What Blackwell Actually Changed

**The dual-die design** is the key architectural decision. Blackwell uses two dies connected by a 10TB/s NVLink-C2C interconnect. From the software perspective, it appears as a single GPU, but the physical die area doubles what any single reticle-limited die could achieve.

The result: 192GB of HBM3e memory, 8TB/s of bandwidth, and 20 petaflops of FP8 AI performance per GPU.

Equally significant is the **fifth-generation NVLink** fabric. A single NVLink Switch can connect up to 576 Blackwell GPUs into a unified fabric, delivering 1.8TB/s of all-to-all bandwidth between GPUs. Training a trillion-parameter model becomes a different kind of engineering problem when the interconnect is fast enough to eliminate inter-node bottlenecks.

> ⚡ A rack of 72 B200 GPUs delivers 1.4 exaflops of AI performance. To match that with H100s, you'd need roughly 200 GPUs.

---

## The FP4 Precision Addition

Hopper introduced FP8 training. Blackwell adds FP4 for inference. The precision reduction sounds alarming — but modern quantization research demonstrates that a well-tuned FP4 inference run on most LLMs loses less than 1% accuracy relative to FP16, while roughly doubling throughput.

This isn't an academic exercise. Data centers running GPT-class models at scale will see real cost reductions from FP4 inference — lower power per token, higher throughput per rack.

---

## What This Means for AI Training Economics

The jump from H100 to B200 is roughly 5x in AI performance. Power draw increased proportionally — the B200 consumes up to 1000W, compared to H100's 700W — but performance-per-watt still improves significantly.

> ⚡ The B200 delivers approximately 2.5x better performance-per-watt versus H100 at FP8 inference workloads.

For hyperscalers running hundreds of thousands of GPUs, even small efficiency gains matter enormously. At $40,000+ per GPU, the economics of a Blackwell deployment depend as much on power and cooling infrastructure as on chip price.

---

## The Bigger Picture

Blackwell isn't primarily about training the next GPT-class model. It's about making inference economically viable at internet scale.

The current constraint on deploying large models broadly isn't capability — it's cost per token. Every architectural decision in Blackwell (FP4, HBM3e, NVLink fabric, dual-die design) is aimed at reducing that number.

The compute trajectory is clear: inference, at scale, efficiently. Blackwell is the current answer to that specific engineering problem.

NVIDIA Blackwell Architecture: What Actually Changed from Hopper and Why It Matters

// COMMENTS

ON THIS PAGE