"Tensor Cores — The Hardware Unit That Makes LLMs Possible"

In 2017, NVIDIA introduced the Volta architecture and with it a new kind of processing unit: the Tensor Core. It was not a subtle engineering refinement. It was a deliberate redesign of what a computation unit should optimize for, based on the recognition that deep learning had become the dominant use case for datacenter GPU workloads. Understanding Tensor Cores requires understanding why standard CUDA cores, despite their parallelism, were leaving significant performance on the table.

## The Inefficiency of Standard Matrix Multiplication on CUDA Cores

Standard CUDA cores perform scalar multiply-accumulate operations: multiply two numbers, add the result to an accumulator. For matrix multiplication, this means computing one element of the output matrix at a time, in parallel across many cores.

The fundamental problem is instruction overhead. Every multiply-accumulate requires an instruction fetch, decode, and dispatch — even though the operation being performed (multiply and add) is always the same. For a matrix multiplication involving millions of arithmetic operations, instruction overhead becomes a significant fraction of total time.

Additionally, matrix multiplication has a specific mathematical structure — the dot product — that can be computed with specialized hardware in fewer clock cycles than a sequence of scalar operations.

## What a Tensor Core Does

A Tensor Core is a hardware unit that performs a 4×4 matrix multiply-accumulate (MMA) in a single operation:

```
D = A × B + C
```

Where A, B, C, D are 4×4 matrices. In a single clock cycle, one Tensor Core computes 64 multiply-accumulate operations (4×4×4). A CUDA core would require 64 separate instructions to achieve the same result.

The throughput improvement is substantial. An H100 SXM5 GPU delivers:
- **CUDA cores (FP32)**: ~67 TFLOPS
- **Tensor Cores (FP16)**: ~989 TFLOPS  
- **Tensor Cores (BF16)**: ~989 TFLOPS
- **Tensor Cores (FP8)**: ~3,958 TFLOPS

The factor-of-~60 improvement from FP32 CUDA cores to FP8 Tensor Cores on the same chip is the story of modern AI hardware efficiency.

## Mixed Precision: Trading Precision for Speed

The ability to use lower-precision arithmetic (FP16, BF16, FP8) rather than full FP32 is as important as the Tensor Core architecture itself. The key insight is that neural networks do not require the precision that scientific computing historically demanded.

**FP32 (single precision)**: 1 sign bit, 8 exponent bits, 23 mantissa bits. Standard floating-point for scientific computing.

**FP16 (half precision)**: 1 sign bit, 5 exponent bits, 10 mantissa bits. Half the storage, more limited dynamic range.

**BF16 (brain floating point)**: 1 sign bit, 8 exponent bits, 7 mantissa bits. Same range as FP32, less precision. Developed by Google Brain specifically for ML.

**FP8 (E4M3 and E5M2)**: 1 sign bit, 4 or 5 exponent bits, 3 or 2 mantissa bits. Extremely limited precision; requires careful scaling.

> ⚡ BF16's design choice — preserving FP32's 8-bit exponent rather than FP16's 5-bit exponent — was specifically motivated by the observation that neural network training produces activations with large dynamic range (requiring the wide exponent) but can tolerate low mantissa precision. It is hardware designed around the specific mathematical behavior of gradient descent.

## Mixed Precision Training: The Practical Pattern

NVIDIA's Automatic Mixed Precision (AMP) training pattern, now standard across PyTorch and TensorFlow:

1. **Weights stored in FP32**: Master weights maintained at full precision for numerical stability during gradient accumulation
2. **Forward and backward passes in FP16/BF16**: Operations computed in reduced precision using Tensor Cores
3. **Gradient scaling**: Loss scaled up before backward pass, scaled down before optimizer step, to prevent FP16 underflow
4. **Optimizer step in FP32**: Weight updates applied to FP32 master weights

This pattern captures approximately 50–80% of the theoretical Tensor Core speedup while maintaining training stability. Without gradient scaling, FP16 training frequently produces NaN gradients due to underflow in the small gradient values typical of later training stages.

## FP8 Training: The Frontier

H100 and H200 GPUs support FP8 Tensor Core operations, offering another ~4× throughput improvement over BF16. FP8 training is more challenging because the reduced dynamic range requires per-tensor or per-block scaling (not just a global loss scale).

Transformer Engine — NVIDIA's hardware-software co-design layer — handles FP8 scaling automatically for supported operations, enabling FP8 training without manual intervention in the application code. The scaling metadata overhead is non-trivial but manageable.

## The Tensor Core in Context: A5000 vs H100

| Chip | Architecture | FP16 Tensor TFLOPS | Memory | Memory BW |
|------|-------------|---------------------|--------|-----------|
| RTX 3090 | Ampere | 142 TFLOPS | 24 GB GDDR6X | 936 GB/s |
| A100 SXM4 | Ampere | 312 TFLOPS | 80 GB HBM2e | 2,000 GB/s |
| H100 SXM5 | Hopper | 989 TFLOPS | 80 GB HBM3 | 3,350 GB/s |
| H200 SXM | Hopper | 989 TFLOPS | 141 GB HBM3e | 4,800 GB/s |

Note that H200's primary improvement over H100 is memory capacity and bandwidth, not compute. This is telling — it suggests memory bandwidth is already the binding constraint for many inference workloads, not raw compute. The next chapter addresses exactly this.

## The Software Stack Above Tensor Cores

Tensor Cores are not directly programmable by most users. They are exposed through:

- **cuBLAS/cuDNN**: NVIDIA's optimized linear algebra and deep learning primitive libraries
- **CUTLASS**: Open-source template library for custom matrix operations
- **TensorRT**: Inference optimization framework
- **Transformer Engine**: Automatic FP8/BF16 mixed precision management

The efficiency of a model on Tensor Cores depends heavily on how well its operations map to the tile sizes that Tensor Cores natively process. Operations on very small matrices, irregular shapes, or non-contiguous memory access patterns may not effectively utilize Tensor Core hardware even when nominally using FP16/BF16 computation.

> ⚡ The practical lesson: model architecture choices affect hardware utilization significantly. Attention head dimensions that are multiples of 64, batch sizes that fill Tensor Core tiles, sequence lengths that enable efficient memory tiling — these are not arbitrary choices. They are hardware-aware design decisions that determine whether you get 30% or 80% utilization of your GPU's theoretical peak performance.

The Tensor Core is the reason a 70-billion parameter LLM can run inference in seconds rather than hours on current hardware. It is also the reason understanding hardware is not optional for anyone designing AI systems at scale.

"Tensor Cores — The Hardware Unit That Makes LLMs Possible"

// COMMENTS

ON THIS PAGE