LLM Inference Engineering: What Actually Happens Between "Submit" and Your Response

You hit Enter. The model responds. It looks instant — or close to it. But between those two moments, something extraordinary is happening at the hardware level. Most coverage treats this as a black box. It isn't.

## The Request Pipeline

When your query hits a modern LLM inference server, it doesn't go to a single GPU waiting patiently for your text. It enters a queue managed by an **inference engine** — systems like NVIDIA TensorRT-LLM, vLLM, or Hugging Face TGI — that are doing continuous batching across hundreds of simultaneous requests.

> ⚡ A single A100 80GB GPU handles roughly 200–400 tokens per second for a 7B-parameter model. Scale to a 70B model, and you need multiple GPUs just to hold the weights in memory. GPT-4-class models require clusters of hundreds.

The inference engine's job is to maximize GPU utilization by batching requests from different users together. You're not getting a dedicated GPU. You're sharing compute across dozens of simultaneous sessions in real time.

---

## Prefill vs. Decode

There are two computationally distinct phases that most users never think about:

**Prefill**: Your entire input prompt is processed in a single forward pass. This is highly parallelizable — all input tokens are computed simultaneously. For a short prompt, this is nearly instantaneous. For a 50,000-token document, the prefill phase alone can take seconds.

**Decode**: After prefill, the model generates one token at a time, autoregressively. Each token requires a full forward pass. This is why longer responses take longer — it's not streaming theater. It's actual sequential computation.

A 400-token response means 400 separate forward passes. Each pass for a 70B model requires moving roughly **140 gigabytes** of weights through GPU memory. The memory bandwidth of an A100 is 2 TB/s. Do the math — you're hitting the hardware ceiling on every single token.

---

## The KV Cache Problem

Every decoder layer in a transformer needs to look back at all previous tokens via key-value pairs — the **KV cache**. As context grows, so does the cache. For long conversations, the KV cache can consume 30–50% of available GPU memory.

> ⚡ This is why LLM providers charge more for long-context requests. It's not arbitrary pricing — it's a real memory constraint. More context means more memory, means fewer concurrent users per GPU.

Modern inference systems use **PagedAttention** (pioneered by vLLM) to manage this more efficiently, treating KV cache like virtual memory in an OS. The analogy isn't superficial — the engineering problems are genuinely similar. Pages can be swapped, shared, and reused across requests.

---

## Quantization and the Throughput-Quality Tradeoff

Running at full FP16 precision is expensive. Most production systems quantize weights to **INT8** or **INT4**, reducing memory footprint by 2–4x. NVIDIA's FP8 format, supported in Hopper (H100) and Blackwell (B200) architectures, has become the production standard at scale.

The tradeoff: INT4 quantization on a 70B model degrades complex reasoning benchmarks (MMLU, HumanEval) by 2–5% — acceptable for chat, potentially meaningful for code generation or multi-step reasoning.

---

## The Bigger Picture

When you get a coherent response in under a second, it's not because the model is simple. It's because engineers have built a sophisticated stack of batching algorithms, memory management systems, and hardware-optimized CUDA kernels that extract every teraflop from expensive silicon.

The next time latency spikes during peak hours, you're not hitting a software bug. You're hitting the physical limits of GPU memory bandwidth — and the economics of serving intelligence at scale.

LLM Inference Engineering: What Actually Happens Between "Submit" and Your Response

// COMMENTS

ON THIS PAGE