"Speculative Decoding: The Inference Trick That Quietly Fixed LLM Latency"

Transformer inference is embarrassingly serial. Generate one token, wait, generate the next. The autoregressive loop was always the latency wall — and for three years, the industry accepted it as a fundamental constraint.

Speculative decoding breaks that constraint without changing the model.

## The Problem With Autoregressive Decoding

Standard token generation is a sequential bottleneck by design. Each token depends on all previous tokens, so you can't parallelize across positions. The GPU is massively underutilized during the memory-bandwidth-limited sampling step.

**The numbers are stark:** a 70B parameter model on a single A100 might generate 30–50 tokens/second. The actual matrix multiplications could run orders of magnitude faster — the bottleneck is the serial sampling loop, not the arithmetic.

---

## How Speculative Decoding Actually Works

A small "draft" model (7B or smaller) generates several candidate tokens in parallel. The large "verifier" model then checks all of them simultaneously in a single forward pass.

> ⚡ One forward pass of the large model can accept or reject K draft tokens simultaneously — turning K serial steps into 1 verification step.

If the draft is accepted (probability proportional to agreement with the verifier's distribution), you get K tokens for the cost of one large-model forward pass. If rejected, you fall back to the verifier's sample at that position and restart drafting.

The math works because verifying is cheaper than generating: verification is a single parallel forward pass, while generation requires K sequential passes.

---

## What the Engineering Numbers Say

Google's implementation on Gemini shows 2–3× throughput improvement at equivalent output quality. Meta's speculative decoding for Llama 3 70B achieved similar gains on the same hardware.

The acceptance rate depends on how well the draft model predicts the verifier. On typical chat completions:
- Code generation: 70–80% token acceptance rate
- Factual Q&A: 60–75%
- Creative writing: 50–65%

Even at 50% acceptance, the net throughput gain is roughly 1.8× because even partial draft acceptance amortizes the large-model forward pass cost.

---

## The Bigger Picture

Speculative decoding didn't require new hardware, new architectures, or new training. It reused existing smaller models as efficient drafters for larger ones. That's the kind of systems-level optimization that compounds: better draft models mean higher acceptance rates, which means more throughput gains from the same hardware.

The real implication is that inference efficiency is increasingly a systems engineering problem, not a hardware problem. You don't need the next GPU generation to double throughput — you need smarter scheduling of the compute you already have.

This is what the inference optimization wave of 2025–2026 is actually built on.

"Speculative Decoding: The Inference Trick That Quietly Fixed LLM Latency"

// COMMENTS

ON THIS PAGE