OpenAI o3: What the Reasoning Architecture Actually Does

The ARC-AGI benchmark numbers for o3 are impressive — and that's exactly why you should be careful about what you read into them.

ARC-AGI tests a specific kind of fluid reasoning: given novel visual patterns, apply the implicit rule to new examples. It's designed to resist memorisation. GPT-4 and similar models score poorly because they can't generalise the pattern without training on it. o3 scores highly. That's real, and it's interesting.

But here's what the benchmark doesn't tell you. The test environment for o3 allowed significant compute at inference time — thousands of tokens of "thinking" before the final answer. The high-compute configuration costs around $3,000 per task. For ARC-AGI tasks. That's not a deployment cost, that's a benchmark cost. The gap between benchmark performance and production-viable performance is enormous, and it's not discussed nearly enough.

The architecture shift is real, though. Prior models — GPT-4 and even early GPT-4o variants — are trained to produce answers. o3 is trained to produce *reasoning chains*, and then trained to evaluate those chains. It's reinforcement learning on the outputs of its own extended reasoning process, not just RLHF on human preference data. The practical effect is that the model can retry its own failed reasoning steps rather than outputting its first-pass answer.

What people call "thinking tokens" aren't the same as human reasoning chains. A human mathematician working through a proof is building up an explicit logical structure. o3's chain-of-thought is a learned *representation* of extended reasoning — it looks like reasoning, it often functions like reasoning, but it's not transparently inspectable the same way. The model isn't "thinking" in any philosophical sense. It's doing a very expensive form of pattern completion in a high-dimensional space, where the path matters to the output.

Where does this actually help? Tasks with verifiable intermediate steps. Coding (you can run the code), math (you can check the proof), formal logic. Where it's less clear is the messy, under-specified problems in real engineering work — where the "correct" reasoning chain isn't well-defined, requirements are ambiguous, and the cost of wrong intermediate steps is high.

The comparison to Claude Sonnet and GPT-4o on coding benchmarks is interesting. o3 does better on hard competitive programming problems. But for the kind of code most engineers write — mostly boilerplate, glue, and domain-specific context — the gap is much smaller, and sometimes Claude's better code style wins in practice over o3's technically correct but verbose output.

My actual take: o3 is a meaningful architecture advance. The chain-of-thought-as-training-target approach is probably a durable direction for reasoning-heavy tasks. But anyone claiming this changes the AGI timeline based on one benchmark hasn't thought carefully about what the benchmark actually measures. Passing ARC-AGI with expensive inference-time search is interesting. It's just not what the press release implies.

OpenAI o3: What the Reasoning Architecture Actually Does

// COMMENTS

ON THIS PAGE