"Training vs Inference — Why They Need Different Hardware"

AI computation has two fundamental phases: training, in which a model learns from data by repeatedly adjusting billions of parameters; and inference, in which a trained model generates predictions or responses. These phases have radically different computational profiles, and a hardware architecture optimized for one performs suboptimally for the other. This divergence has driven the creation of an entire ecosystem of specialized silicon — and explains why the AI chip market looks like more than just "bigger GPUs."

## The Mathematics of Training

Training a large language model requires:

1. **Forward pass**: Compute model outputs from inputs (same as inference)
2. **Loss computation**: Compare outputs to targets
3. **Backward pass**: Compute gradients through the entire network via backpropagation
4. **Weight update**: Apply optimizer (Adam, AdamW) to update all parameters

The backward pass is approximately 2× as expensive as the forward pass. The optimizer state — for Adam, this includes first and second moment estimates for every parameter — adds another 2× memory overhead beyond the model weights themselves.

For a 70B parameter model in full BF16 training:
- Model weights: 140 GB
- Gradients: 140 GB
- Optimizer states (FP32): 560 GB
- Activations (batch-dependent): variable, potentially hundreds of GB

**Total: ~840 GB minimum.** A single H100 with 80 GB holds only a fraction. This is why large model training requires tensor parallelism, pipeline parallelism, and data parallelism across hundreds or thousands of GPUs simultaneously.

## Training Optimization: Maximize Throughput

The key metric for training is **throughput**: tokens per second, samples per second, FLOPS utilization. A single training run of a frontier model might require 10²³ to 10²⁵ FLOPs total, taking weeks to months on large GPU clusters. The cost is dominated by the product of cluster size × time × power cost per watt.

Maximizing throughput means:
- **Large batch sizes**: Amortize weight loading overhead across many samples. Matrix multiplications become larger, improving arithmetic intensity.
- **High memory bandwidth**: Move activations and gradients rapidly between memory and compute
- **NVLink/NVSwitch**: Fast GPU-to-GPU interconnect for tensor parallelism within a node
- **InfiniBand**: Fast network for pipeline and data parallelism across nodes

The H100 SXM5, with 3,350 GB/s HBM3 bandwidth and 900 GB/s NVLink, is optimized for exactly these requirements. Training is where raw FLOPS matter most, because large batch sizes raise arithmetic intensity above the memory-bandwidth bottleneck.

## Inference Optimization: Minimize Latency

Inference requirements diverge sharply from training:

- **Online inference** (chatbot, real-time API): Minimize latency — time to first token, tokens per second per request
- **Offline inference** (batch processing, embeddings): Maximize throughput
- **Edge inference** (mobile, embedded): Minimize power and model size

For online inference at small batch sizes (1–8 requests), arithmetic intensity is low — workloads are memory-bandwidth-bound as described in the previous chapter. The optimization priorities become:
- **Memory capacity**: Fit the full model on-chip
- **Memory bandwidth**: Move weights faster
- **Low latency interconnect**: Minimal overhead for inter-layer operations

> ⚡ A key insight: for inference, a chip with 2× the memory bandwidth but the same FLOPS as a training GPU will often deliver better performance for single-user latency. The Hopper H100's improvements in memory bandwidth per dollar over the Ampere A100 directly translated to better inference throughput, independent of the raw FLOPS improvement.

## Why Google Built TPUs

Google's Tensor Processing Units (TPUs) represent a deliberately different design philosophy from GPUs: hardware purpose-built for neural network workloads from the ground up, rather than GPU compute adapted for deep learning.

**TPU v4** characteristics:
- **Matrix Multiply Unit (MXU)**: 256×256 systolic array — a hardware structure specifically optimized for matrix multiplication rather than general parallel compute
- **High Bandwidth Memory**: 1,200 GB/s per chip
- **Tight coupling with datacenter network**: Custom ICI (Inter-chip Interconnect) at 600 GB/s per chip, enabling very large pod configurations
- **No CUDA ecosystem**: TPUs use JAX/XLA as the programming model

The systolic array at the MXU core is architecturally different from CUDA/Tensor Cores: data flows through the array in a wave pattern, with each cell performing multiply-accumulate and passing results to neighbors. For large batch matrix multiplication, this is extremely efficient. For small batch inference or operations outside the systolic array, it is less so.

Google uses TPUs primarily for training and large-batch inference on their own models (Gemini). The proprietary advantage is significant: Google does not pay NVIDIA's margins, controls the hardware roadmap, and can co-optimize model architecture and hardware.

## Groq LPU: Optimized for Inference Latency

Groq's Language Processing Unit (LPU) takes a different approach: a compiler-determined, deterministic execution model designed to minimize the memory latency problem.

Key architectural choices:
- **On-chip SRAM only**: No HBM. All model weights must fit in on-chip SRAM. Extremely fast access (no HBM latency), but capacity-limited.
- **Deterministic execution**: The compiler schedules all operations at compile time. No runtime dispatch overhead, no cache misses, no speculation — all execution is predetermined.
- **Extremely high SRAM bandwidth**: ~80 TB/s aggregate on-chip bandwidth vs H100's 3.35 TB/s HBM

For smaller models that fit in SRAM, Groq achieves dramatically lower latency than H100 — often cited at 10× faster for time-to-first-token on comparable model sizes. For large models that don't fit on-chip, multi-chip configurations are required with significantly more complexity.

## Cerebras WSE: The Wafer-Scale Extreme

Cerebras's Wafer Scale Engine takes the opposite approach to chip integration: instead of connecting multiple chips, make the chip as large as physically possible. The WSE-3 is a single semiconductor die the size of a full wafer:

- **900,000 AI cores**
- **44 GB on-chip SRAM** (no HBM)
- **21 PB/s** on-chip memory bandwidth
- **125 PFLOPS** compute

The advantage for training: eliminates all inter-chip communication bottlenecks. Gradient synchronization across 900K cores on a single die with 21 PB/s bandwidth is orders of magnitude faster than synchronization across multiple GPUs over NVLink or InfiniBand.

The limitation: 44 GB SRAM can fit models up to roughly 20B parameters in FP16. For larger models, distributed configurations are still necessary.

## The Hardware Diversity Conclusion

The existence of TPUs, Groq LPUs, Cerebras WSEs, and multiple other specialized AI accelerators is not market fragmentation — it is hardware physics following application requirements.

| Use Case | Primary Constraint | Optimal Hardware |
|----------|-------------------|-----------------|
| Large model training | Throughput + inter-chip BW | H100 cluster / TPU pod |
| Low-latency inference (small model) | Memory bandwidth + latency | Groq LPU / H100 |
| Low-latency inference (large model) | Memory capacity + BW | H200 cluster |
| Edge/mobile inference | Power + cost | Qualcomm NPU / Apple Neural Engine |
| Research (flexible) | Versatility | H100/A100 (CUDA ecosystem) |

The next chapter examines what happens beyond current silicon — what physical limits constrain conventional approaches and what the candidate technologies for the next decade look like.

"Training vs Inference — Why They Need Different Hardware"

// COMMENTS

ON THIS PAGE