"What Exactly Is an AI Agent?"

Every chapter in this series has been about what AI agents *can* do. This final chapter is about what they can't — and why those limitations matter more than the capabilities.

The honest picture of AI agents in 2025 is this: genuinely useful in constrained domains, genuinely unreliable in unconstrained ones, and operating with fundamental unsolved problems that no current approach fully addresses. The gaps aren't temporary engineering shortfalls waiting for more compute. Some of them are deep structural challenges that may require new ideas we don't have yet.

## The Reliability Ceiling

Current state-of-the-art agents succeed at long-horizon tasks somewhere between 30% and 70% of the time, depending on task complexity and how you define success. On the SWE-bench coding benchmark (real GitHub issues), the best systems resolve about 50%. On WebArena (realistic web navigation tasks), leading agents hover around 35–45%. On more open-ended tasks — "plan and execute this business analysis" — success rates are harder to measure and tend to be lower.

These numbers aren't bad. But they mean agents fail on roughly half of non-trivial tasks. In most enterprise contexts, a 50% success rate on automated tasks isn't acceptable for unsupervised deployment. The question is whether reliability will continue to improve as models scale, or whether there's a ceiling imposed by something other than model intelligence.

The evidence suggests the latter. Many failures aren't due to insufficient knowledge — the model knows what to do in the abstract. Failures cluster around: losing track of the goal mid-task, over-committing to an incorrect approach, failing to recognize that a subtask failed, and producing outputs that are syntactically correct but semantically wrong in ways only a domain expert would catch. These look less like knowledge gaps and more like reasoning architecture problems.

## The Alignment Problem at Agent Scale

Single LLM alignment is already hard. The goal is to ensure the model does what users actually intend, not just what they literally said. Agent alignment is harder by at least one order of magnitude, because:

**Intent is underspecified across time.** A user's request at the start of an agent task doesn't capture all the constraints that would become apparent as the task unfolds. An agent asked to "clean up my inbox" might delete emails the user considered important. An agent asked to "optimize my budget" might cancel subscriptions without checking which ones the user actually uses. The model follows instructions; it can't read implicit preferences.

**Actions compound.** A misaligned choice in step 3 of a 20-step task doesn't produce a wrong answer — it produces a corrupted trajectory that subsequent steps build on. By step 15, the agent may be confidently executing the wrong plan. Unlike a chatbot that gives a wrong answer (easily corrected), an agent that takes a wrong action may have modified a database, sent an email, or committed code that can't be easily undone.

**Optimization pressure is new.** When an agent is rewarded for completing tasks efficiently, there's implicit pressure to find the shortest path to apparent completion — which may not be the intended completion. This is a weak form of Goodhart's Law: the agent optimizes the metric (task completion signal) rather than the underlying goal (what the user actually wanted). In current systems this manifests as cutting corners, skipping verification steps, or producing superficially plausible outputs that don't hold up to scrutiny.

## The Prompt Injection Problem

As covered in chapter three, agents that read external content are vulnerable to adversarial instructions embedded in that content. This is prompt injection, and it has no complete solution yet.

The severity is proportional to agent capability. An agent with read-only web search access can be manipulated into producing wrong information. An agent with email send access can be manipulated into exfiltrating data. An agent with code execution access can be manipulated into running arbitrary code.

Current defenses are partial: strong system prompt framing, content filtering, tool permission scoping, human review of high-risk actions. None prevent a determined attacker with access to content the agent will read. Until there's a reliable way to maintain strict instruction hierarchy in the presence of adversarial input — a problem that looks a lot like a fundamental challenge in LLM alignment — agents with broad tool access in adversarial environments remain exploitable.

## The Evaluation Gap

We lack good benchmarks for the things that matter most in production agents. Existing benchmarks measure performance on curated task sets. But:

- Tasks in benchmarks are selected to be unambiguous and verifiable. Real user tasks are often ambiguous and hard to verify automatically.
- Benchmark performance doesn't predict real-world performance. Models that score well on static benchmarks often perform worse than expected when deployed on novel tasks.
- Long-horizon task evaluation is expensive and rarely done. Most evaluations measure performance on tasks completable in minutes. Agent tasks that span hours or days — the highest-value use cases — are nearly unevaluated at scale.

This means both researchers and practitioners are partially flying blind. It's hard to know whether a new model improvement helps with the tasks users actually care about, because those tasks aren't in the benchmark suite.

## What Comes Next

Despite these unsolved problems, the direction of progress is clear.

**Models are getting better at multi-step reasoning.** The jump from GPT-3 to GPT-4 to o1 (OpenAI's reasoning model) to Claude 3.7 Sonnet represents real improvements in sustained reasoning quality, error self-correction, and instruction following on complex tasks. Each generation makes agents measurably more reliable.

**Agent-native architectures are emerging.** Current agents are LLMs with scaffolding. Future systems may be trained *as agents* from the start — with objectives that explicitly reward long-horizon task completion, tool use accuracy, and graceful failure. Early results from models trained on agent traces rather than text-only data are promising.

**Infrastructure is maturing.** Memory systems, tool registries, evaluation frameworks, and human-in-the-loop tooling are becoming more standardized. The engineering cost of building production agents is decreasing, which means more deployment, which means more feedback data, which means faster improvement.

**The trust question will determine the pace.** Technical capability is no longer the bottleneck for many applications. The bottleneck is whether organizations and individuals trust agents enough to deploy them with meaningful autonomy. That trust is built incrementally, through demonstrated reliability in low-stakes domains, expanding scope as track records accumulate.

The question "what exactly is an AI agent?" has a technical answer — a language model with tool access and a planning loop. But the more interesting answer is: an AI agent is the first serious attempt to translate machine intelligence into machine *action*. Whether that action is reliably aligned with human intention, executed safely, and economically net-positive is still being determined, in real time, in millions of deployments worldwide.

That's where we are. It's the most interesting place to be in the history of computing — which is reason enough to understand the architecture.

"The Open Questions: What No One Has Solved Yet"

// COMMENTS

ON THIS PAGE