"Agents in the Wild: What's Already Deployed in 2025"

Enough theory. What are AI agents actually doing in 2025, in production, at scale? The answer is more varied than the headlines suggest — some deployments are genuinely transformative, others are polished demos dressed up as products, and a few are quietly reshaping entire industries with almost no public fanfare.

This is the ground truth.

## Software Development: The Clearest Win

Coding is the domain where AI agents have most unambiguously delivered on their promise. The reasons are structural: code has objective correctness criteria (does it run? do tests pass?), the environment is fully observable (files, errors, test results are all readable), and the feedback loop is fast.

**Cursor** is the most widely used AI coding agent in practice. Rather than an autonomous agent, it's a co-pilot model — the developer stays in the loop, accepting or rejecting suggestions. But its "Composer" feature allows multi-file edits from a single instruction, which is genuine agentic behavior. By early 2025, Cursor reported over 100,000 paying users, many of whom describe it as changing how they work, not just speeding it up.

**Devin** (Cognition AI) operates at the more autonomous end: given a GitHub issue or a task description, it opens a browser, reads documentation, writes code, runs tests, iterates on failures, and opens a pull request. On the SWE-bench benchmark, Devin resolved 13.9% of issues — the first public system to exceed 10%. Subsequent systems pushed this above 40% by mid-2025. Not a replacement for senior engineers. But something new.

**GitHub Copilot Workspace** (Microsoft/GitHub) represents the enterprise-scale deployment: coding agents embedded directly into the GitHub interface, with access to the entire repository context, CI/CD systems, and code review workflows. Millions of developers interact with it daily, even if most interactions look like intelligent autocomplete rather than autonomous action.

## Research: Deep Reading at Scale

Research tasks — finding papers, synthesizing literature, generating hypotheses — were a natural early use case. The workflow maps cleanly to the agent pattern: search, read, reason, synthesize.

**OpenAI Deep Research** (launched early 2025) demonstrated the current ceiling. Given a complex research question, it autonomously searches the web for tens of minutes, reads dozens of sources, and produces structured reports with citations. Benchmarks on professional research tasks showed it outperforming human researchers on some tasks and falling short on others — particularly where domain expertise and source credibility judgment were critical.

Academic institutions are deploying research agents internally for literature review, with careful human oversight on conclusions. The value is speed: a task that previously took two weeks of a research assistant's time can produce a working draft in 30 minutes. The quality still requires expert review, but the baseline is raised.

## Customer Service: The Boring Success Story

AI agents in customer service aren't glamorous, but they're where the economic impact is largest and most measurable. Every major telecom, bank, airline, and e-commerce company is either deployed or piloting AI agents for tier-1 support.

The architecture is consistent: a natural language understanding layer routes queries to specialized agents (order status, refunds, technical support, account changes), each with tool access to the relevant backend systems. Simple queries are resolved fully automatically; complex or escalated cases route to human agents with full context already assembled.

Klarna (fintech) reported in early 2024 that its AI agent was handling two-thirds of customer support conversations, equivalent to 700 human agents. Response time dropped from 11 minutes to under 2 minutes. Error rates were comparable to human agents. Whether those numbers were exactly right, the directional story was broadly validated by similar reports from other companies.

The caveats: agents fail badly on edge cases, get stuck in loops on ambiguous requests, and produce confident-sounding wrong answers that damage trust more than silence would. The good deployments pair automation with rapid human escalation paths and continuous monitoring for failure patterns.

## Personal Productivity: The Emerging Wave

The most speculative but most-watched category. Can an AI agent be a genuine personal assistant — managing your calendar, filtering email, doing research, drafting communications, coordinating between services — not occasionally, but reliably, daily?

**Lindy** and **Embra** are building in this direction, with agent products that integrate email, calendar, CRM, and knowledge bases. Early adopters report meaningful time savings on administrative overhead. The reliability isn't there yet for high-stakes tasks (agents still misunderstand intent, take wrong actions, and require correction), but the trajectory is toward use by default, not use with caution.

OpenAI's **Operator** (GPT-4-based web agent) and Anthropic's **Computer Use** feature (Claude controlling a real computer desktop) represent the frontier: agents that use computers the way humans do, navigating interfaces that weren't designed for API access. The current versions are slow, prone to getting stuck on unexpected UI states, and require explicit supervision. But they're getting faster and more reliable with each model generation.

## What Separates Deployed Systems from Lab Demos

Looking across successful deployments, a pattern emerges in what makes agents work in practice versus fall apart:

**Constrained scope beats general capability.** Every successful deployment targets a specific workflow with defined inputs, outputs, and failure modes. The customer service agent that handles 20 well-defined query types reliably beats the general-purpose agent that handles 100 query types poorly.

**Observable success criteria.** Software agents can run tests and know if they passed. Research agents can check whether a citation exists. Agents operating in domains without objective success criteria — creative tasks, strategic advice, interpersonal communication — have no way to verify their own output quality.

**Graceful degradation.** The best production agents are designed to fail predictably. When they're uncertain, they say so. When they can't complete a task, they transfer to a human with full context, not a blank slate. The hardest engineering in agent deployment is often designing the failure modes.

The gap between "impressive in a demo" and "reliable in production" remains large. Bridging it is the current central challenge in applied AI — and it's inseparable from the deeper questions the final chapter addresses: what agents still can't do, and what that means.

"Agents in the Wild: What's Already Deployed in 2025"

// COMMENTS

ON THIS PAGE