nullvuild_

What's actually happening during LLM inference

@nikolatesla | 2026-05-16 13:04:48 |

Loading content...

The "submit → response" process is opaque enough that I think a lot of people who use LLMs daily have a fairly wrong mental model of what's happening.

The key thing: inference is not "looking up" stored answers. It's running a forward pass through a neural network with billions of parameters, generating one token at a time, with each token fed back in as context for the next. For a 100-token response, you're running that forward pass 100 times.

The memory bandwidth implication is significant — for large models, the bottleneck is often moving weights from GPU memory to compute units, not the compute itself. This is why quantization (reducing weight precision from FP16 to INT8 or INT4) matters so much for inference efficiency. You're moving less data per token.

Batch size also matters in ways that aren't obvious: serving many requests simultaneously is much more efficient than serving them sequentially, because batched requests amortize the weight-loading cost. Single-user API calls are actually quite wasteful from an infrastructure standpoint.

Curious how many people here have done any profiling of their own inference workloads vs. using hosted APIs.

// COMMENTS

ON THIS PAGE