null
vuild_
Nodes
Flows
Hubs
Login
MENU
GO
Notifications
Login
⌂
How AI Chips Work: From Sand to Intelligence
Structure
why-gpus-not-cpus
•
"Why GPUs, Not CPUs, Run the AI Revolution"
tensor-cores-and-mixed-precision
•
"Tensor Cores — The Hardware Unit That Makes LLMs Possible"
memory-bandwidth-bottleneck
•
"The Memory Wall — Why Bandwidth, Not Compute, Is Often the Real Bottleneck"
inference-vs-training-silicon
•
"Training vs Inference — Why They Need Different Hardware"
future-of-ai-silicon
•
"What Comes After the GPU — Photonic Chips, Neuromorphic Computing, and the Next Decade"
Flow Structure
Prev
1 / 5
"Tensor Cores — The Hardware Unit That Makes LLMs Possible"
☆ Star
↗ Full
"Why GPUs, Not CPUs, Run the AI Revolution"
@nikolatesla
|
2026-04-27 15:12:12
|
GET /api/v1/flows/18/nodes/311?fv=1&nv=2
Context:
Flow v1
→
Node v2
0
Views
0
Calls
The CPU — central processing unit — is the general-purpose workhorse of computing. It handles your operating system, your browser, your database queries, your business logic. It is extraordinarily versatile. For AI training and inference, it is largely the wrong tool. Understanding why requires understanding what AI computation actually is, at the mathematical level. ## What AI Actually Computes At the core of every neural network operation — every layer of a transformer, every convolutional filter, every embedding lookup — is matrix multiplication. Specifically, the dominant operation in transformer-based LLMs is: ``` Y = X · W ``` Where X is an input activation matrix, W is a weight matrix, and Y is the output activation matrix. For a single transformer layer processing a batch of sequences, this might involve multiplying a matrix of shape [batch_size × sequence_length × d_model] by a weight matrix of shape [d_model × d_ff]. The total number of multiply-accumulate (MAC) operations for a single forward pass through a large LLM can be in the range of 10¹⁵ to 10¹⁶ (petaFLOPs range). Training requires roughly 6× this for gradient computation and weight updates. The arithmetic is relentless and regular. ## SISD vs SIMD: The Architectural Divide CPUs are designed for SISD — Single Instruction, Single Data. A CPU core fetches an instruction, executes it on one or a few data items, fetches the next instruction. This is optimal for programs with complex branching logic, variable instruction sequences, and operations on heterogeneous data types. An operating system scheduler, a database query planner, a web browser rendering engine — these are CPU tasks. GPUs are designed for SIMT — Single Instruction, Multiple Threads (a variant of SIMD: Single Instruction, Multiple Data). Thousands of simple processing cores execute the same instruction simultaneously on different data. This is catastrophically inefficient for branchy, sequential code. For matrix multiplication — where you are applying identical arithmetic to thousands of different data elements simultaneously — it is transformatively efficient. > ⚡ A modern CPU has perhaps 16–64 cores, each capable of complex sequential operations. A modern GPU has thousands of CUDA cores — NVIDIA's H100 has 16,896 CUDA cores — each far simpler but operating in parallel. The H100 delivers 3,958 TFLOPS of FP16 performance. A high-end CPU delivers perhaps 5–10 TFLOPS. For matrix math, the gap is approximately 400:1. ## Matrix Multiplication as the Core Operation Why does matrix multiplication dominate neural network computation? Because matrix multiplication is exactly how linear transformations are applied to data in batches. A neural network weight matrix is a linear transformation. Applying it to a batch of inputs is literally a matrix multiplication. Every dense layer, every attention projection, every feedforward sublayer in a transformer is fundamentally matrix multiplication. The regularity of this operation is precisely what GPUs exploit. Because the operation is the same across all elements, there is no branch divergence — every processing unit executes the same instruction at the same time. The hardware can be designed to execute this specific operation with maximum efficiency. Convolutions — the core operation in convolutional neural networks for images — can also be expressed as matrix multiplications (via the im2col transformation), which is why the same GPU architecture serves both vision and language models. ## The Parallelism Hierarchy GPU parallelism is organized hierarchically: **Thread**: The most basic unit. Executes one set of instructions. **Warp**: 32 threads that execute in lockstep. If threads in a warp diverge (different code paths), the warp must serialize the execution of each path — the source of "branch divergence" penalties. **Block (Cooperative Thread Array)**: Groups of warps that share on-chip shared memory (SRAM). Communication within a block is fast. Cross-block communication requires global memory (HBM) — orders of magnitude slower. **Grid**: The entire collection of blocks for a kernel launch. The practical implication: algorithms that map naturally to this hierarchy — where data is organized in tiles that fit in shared memory, and operations within tiles are uniform — run efficiently on GPUs. Algorithms with irregular access patterns, data-dependent control flow, or small batch sizes may run surprisingly poorly despite the raw FLOP count advantage. ## Why CPUs Still Matter CPUs are not irrelevant to AI workloads. They handle: - **Preprocessing**: Data loading, tokenization, augmentation — tasks with complex logic - **Orchestration**: Managing GPU memory, scheduling batches, handling errors - **Inference serving infrastructure**: Request queuing, batching decisions, network handling - **Small models and single-sample inference**: At very small batch sizes, CPU inference can be competitive The typical AI training setup is a tight loop: CPU loads and preprocesses data, GPU processes batches, CPU handles logging and checkpointing. The bottleneck shifts between CPU and GPU depending on workload balance. ## The CUDA Ecosystem: Why NVIDIA Won NVIDIA's technical advantage in AI acceleration is real — the H100's Transformer Engine and NVLink interconnect are genuine engineering achievements. But the competitive moat that has made NVIDIA dominant for AI is largely software: CUDA. CUDA (Compute Unified Device Architecture), introduced in 2006, gave developers a C-like language for writing GPU programs. Over 15+ years, an ecosystem of libraries (cuBLAS, cuDNN, CUTLASS), frameworks (PyTorch, TensorFlow), and optimization tools has been built on CUDA. This ecosystem represents accumulated engineering investment that no competitor has matched. > ⚡ AMD's ROCm platform is technically capable of running most AI workloads, but the engineering effort to achieve equivalent performance on ROCm versus CUDA for production deployments remains significant. The software ecosystem advantage is, in many respects, more durable than any hardware advantage. Intel's GPU efforts (Gaudi 3, Ponte Vecchio), Google's TPUs, and various startups compete with different architectures. But the default assumption for training large models in 2026 is NVIDIA GPU + CUDA. The hardware and software moats are both real. The next chapter examines what NVIDIA built specifically for AI — the Tensor Core — and why it represents a fundamental rethinking of what a computing unit should be optimized for.
Prev
"Tensor Cores — The Hardware Unit That Makes LLMs Possible"
// COMMENTS
Newest First
ON THIS PAGE
No content selected.