What We Learned Running Agents in Production for a Year

January to December 2025. Twelve months of running AI agents in production across different clients, different industries, different scales. This is what we learned.

The first month is lies

Every agent works in the first month. User volume is low. Edge cases haven't surfaced. The model is performing well because your small test set happens to match the training distribution. The metrics look great.

Then month two happens. Traffic scales. Users find creative ways to break your system. The model encounters inputs it handles poorly. The metrics that looked great were measuring the wrong things.

Don't trust any metric until month three. Build your monitoring for the long game.

Prompt engineering is temporary

We spent weeks crafting perfect system prompts. Detailed instructions, few-shot examples, explicit constraints. They worked well until the model provider updated their model.

Between January and December, we experienced four model updates across two providers that meaningfully changed how our prompts performed. Instructions that produced structured output suddenly produced conversational output. Temperature settings that felt consistent started producing erratic results.

Prompts are configuration, not code. They need the same versioning, testing, and rollback infrastructure as any other config. Pin your model version. Test prompt changes against a regression suite before deploying. Have a rollback plan.

Retries solve 80% of failures

Most agent failures are transient. Rate limits, timeouts, network glitches, model provider hiccups. A simple retry with exponential backoff and jitter resolves the vast majority.

We tracked failure root causes across all our agent deployments for a full year:

Transient API errors (resolved by retry): 68%
Model output quality issues: 14%
Tool execution failures: 11%
Genuine bugs in our code: 5%
Unrecoverable errors: 2%

The first category needs retries. The second needs output validation and fallback prompts. The third needs circuit breakers. Build these three mechanisms and you've addressed 93% of failures.

Cost is a design constraint

The most expensive agent we deployed cost $0.47 per interaction. The cheapest cost $0.003. The difference wasn't the model, it was the architecture.

The cheap agent used a small model for classification, cached aggressively, and only called the expensive model for generation. The expensive agent used GPT-4 for every step because "we'll optimize later." Later never came until the bill did.

Set a cost budget per interaction on day one. Treat it like a latency SLA. Design your architecture to hit it.

Users don't care about your AI

This was the humbling one. Users don't care that it's AI. They care that it works. When our agents were fast, accurate, and helpful, users never mentioned the AI. When they were slow, wrong, or confusing, users complained about the product, not the technology.

The best compliment an AI agent can receive is no comment at all.

Evals are your immune system

We started with manual testing. Then we added a regression suite. Then automated evaluation with LLM-as-judge. Then production monitoring with drift detection.

Each layer caught failures the previous layer missed. Manual testing caught obvious bugs. Regression caught prompt regressions. LLM-as-judge caught quality degradation. Production monitoring caught distribution drift.

If you're deploying agents without at least three of these layers, you're flying blind. The question isn't whether your agent will degrade. It's whether you'll detect it before your users do.

The architecture that kept emerging

After twelve months, a consistent shape emerged across all our deployments:

Input validation layer: classify, sanitize, scope-check before any model call
Router: select the right model and tool set based on the input class
Execution loop: model call, tool call, model call, with retry and circuit breaking per step
Output validation: check the response against schema expectations before returning
Observability: structured logs, cost tracking, quality scoring on every response
Feedback loop: user signals flow back into evaluation and prompt improvement

Not novel. It's a pipeline. But after twelve months and hundreds of failure modes, we kept arriving at the same shape. The novel architectures always simplified to this under production pressure.

What we'd do differently on day one

Start with observability. Not after the first incident. Before the first deployment.

Pin model versions. Never point at latest. The convenience isn't worth the instability.

Build the evaluation suite before the agent. Define what "good" looks like before you build the thing that's supposed to be good. This feels backwards. It's not.

Set cost and latency budgets immediately. Not aspirational targets. Hard constraints that shape architecture decisions from the first commit.

Plan for the model to change under you. Because it will. Monthly. Your architecture needs to absorb model-level changes without architecture-level rewrites.

Most of the hard problems aren't AI problems. They're engineering problems. And engineering problems have engineering solutions, if you build for them from the start.

Starting your agent journey? We can save you a few of these lessons.