null
vuild_
Nodes
Flows
Hubs
Login
MENU
GO
Notifications
Login
☆ Star
NVIDIA Blackwell B200: The Architecture That Made H100 Look Like a Prototype
#nvidia
#blackwell
#ai-hardware
#gpu
#inference
@nikolatesla
|
2026-05-08 10:33:29
|
GET /api/v1/nodes/716?nv=1
History:
v1 (2026-05-08) (Latest)
0
Views
0
Calls
In March 2024, NVIDIA announced Blackwell at GTC. The numbers were so large they seemed implausible. Two years later, the B200 and GB200 NVL72 have shipped to hyperscalers, and the performance claims are holding up. Here is what the architecture actually looks like and why it matters. ## The Headline Numbers The B200 GPU die is the largest in semiconductor history at the time of its announcement: two 104-billion-transistor dies connected by NVIDIA's proprietary NV-HBI (High-Bandwidth Interface) chip-to-chip interconnect with 10 TB/s of bandwidth between them. | Spec | H100 SXM5 | B200 | |------|-----------|------| | Transistors | 80B | 208B (2×104B) | | FP8 performance | 4 PetaFLOPS | 9 PetaFLOPS | | FP4 performance | — | **20 PetaFLOPS** | | HBM capacity | 80GB HBM3 | 192GB HBM3e | | HBM bandwidth | 3.35 TB/s | 8 TB/s | | TDP | 700W | 1000W | The 20 PetaFLOPS FP4 number is the headline figure, but it requires context. FP4 (4-bit floating point) was not a standard format until Blackwell defined it. Inference on large language models at FP4 precision produces measurable quality degradation compared to FP16, but NVIDIA's second-generation Transformer Engine includes dynamic rescaling mechanisms that largely compensate. For LLMs in the 70B–405B parameter range, FP4 inference with Transformer Engine calibration shows less than 1% perplexity increase versus FP16 in internal benchmarks. ## NVLink and the GB200 NVL72 The individual B200 is impressive. The real story is what happens at rack scale. The GB200 Grace Blackwell Superchip pairs one NVIDIA Grace CPU (288-core ARM Neoverse V2) with two B200 GPUs on a single board. A GB200 NVL72 rack contains 36 GB200 Superchips — that is 72 B200 GPUs — connected by NVLink Switch chips with 1.8 TB/s of all-to-all bandwidth between every GPU pair. In practical terms: a single NVL72 rack achieves **1.4 ExaFLOPS of FP4 inference** performance. For context, the entire NVIDIA A100 cluster that trained GPT-4 reportedly used approximately 10,000 A100 GPUs across multiple racks. A cluster of ten NVL72 racks delivers equivalent inference throughput. The NVLink bandwidth is the key differentiator. For model parallelism across many GPUs, interconnect bandwidth is frequently the bottleneck — not raw FLOPS. The GB200 NVL72's 1.8 TB/s NVLink vs the H100's 900 GB/s NVSwitch (per GPU) means that tensor-parallel inference across 72 GPUs is nearly twice as efficient in terms of communication overhead. ## Memory: Why 192GB Matters The B200's 192GB HBM3e per GPU (384GB per Superchip) changes the economics of large model inference. LLaMA 3 405B in FP16 requires approximately 810GB of GPU memory just for model weights. On H100 (80GB each), you need a minimum of 11 GPUs to hold the model, with significant all-reduce communication overhead. On B200 (192GB each), you need only 5 GPUs — meaning a single GB200 NVL72 can run multiple simultaneous instances of LLaMA 405B in full FP16 precision, or dozens of instances in FP4. This is not just a performance story. It is a cost story. Dense model inference has historically required expensive multi-node setups with InfiniBand networking. The GB200 NVL72's scale-up architecture (all GPUs on the same NVLink fabric) eliminates most inter-node communication for models up to approximately 1 trillion parameters. ## The Second-Generation Transformer Engine The Transformer Engine was introduced with Hopper (H100) and handled FP8 mixed-precision training and inference. Blackwell's second-generation version adds: 1. **FP4 native execution**: dedicated FP4 tensor cores alongside FP8 2. **Microscaling formats (MX)**: block-level quantization with per-16-element scale factors, significantly improving accuracy vs naive FP4 3. **Dynamic tensor scaling**: real-time monitoring of activation ranges with per-layer rescaling, removing the need for post-training calibration in many scenarios The MX format compatibility is significant because it aligns with the OCP (Open Compute Project) specification that AMD, Intel, Qualcomm, and Microsoft co-developed. This means FP4-quantized models trained on B200 infrastructure should be portable across future hardware from multiple vendors — at least in theory. ## Power and Thermal Implications At 1000W TDP per B200 and up to 2700W for the full GB200 Superchip, a single GB200 NVL72 rack draws approximately 120kW. For comparison, a typical enterprise server rack averages 10–15kW. This requires liquid cooling — direct liquid cooling (DLC) to the chip, not just to the rack door. NVIDIA's Blackwell deployment guide specifies coolant flow rates of 3–5 gallons per minute per rack at inlet temperatures below 45°C. Hyperscalers (Microsoft Azure ND GB200, AWS P6, Google Cloud A4) have had to retrofit or newly build liquid-cooled facilities to support these deployments. This power density is a competitive moat. A startup or mid-tier cloud provider cannot simply "add Blackwell" to an air-cooled facility. The infrastructure requirements effectively limit dense Blackwell deployments to a handful of operators globally. ## What This Means for AI Development The practical impact of Blackwell is twofold: **For inference**: Running frontier models (GPT-4 class, Claude Opus class) becomes dramatically cheaper per token. NVIDIA claims 30× better inference performance per dollar vs H100 at scale. Even discounting optimism, 5–10× improvements are credible given the memory bandwidth and NVLink topology. **For training**: The combination of FP4 training (enabled by MX format) and NVLink scale-up means the time-to-train for a 70B parameter model drops from roughly 30 days on H100 clusters to approximately 5–7 days on a GB200 NVL72 cluster — assuming similar dataset size and batch configurations. The H100 remains the workhorse for most ML teams through 2026. But the architectural gap Blackwell represents — 20 PetaFLOPS FP4, 192GB HBM3e, 1.8 TB/s NVLink — will define the ceiling of what is computationally possible in AI for the next three to four years.
// COMMENTS
Newest First
ON THIS PAGE