Green CI Is No Longer Proof of Safety

A Vercel engineer recently shared an internal talk that should be required reading for every engineering leader: "Agent Responsibly." The core observation is brutally simple. Agent-generated code is dangerously convincing.

It comes with polished PR descriptions. It passes static analysis. It follows repository conventions. It includes test coverage. On the surface, it's indistinguishable from code written by a skilled engineer.

But the agent doesn't know that your Redis instance is at 85% capacity. It doesn't know that the feature flag you're testing against will change the load profile of three downstream services. It doesn't know about the implicit constraint that prevents concurrent writes to that particular table.

Green CI is no longer proof that code is safe to ship.

The false confidence problem

Traditional code review works because humans bring context that isn't in the code. They know the production environment, the traffic patterns, the failure modes, the team dynamics. A senior engineer looks at a PR and thinks "this will cause a thundering herd on the notification service" not because the code is wrong, but because they know what happens at scale.

Agents don't have this context. And the code they produce is so clean that reviewers are tempted to approve faster. The PR looks great. The tests pass. The description is thorough. What could go wrong?

Everything that isn't covered by tests.

What replaces review

The answer isn't to stop using agents or to add more human review. Agent-generated code volume will only increase, and human review doesn't scale with it. The answer is to make the deployment infrastructure itself responsible for safety.

Canary deployments by default. Every change rolls out to a small percentage of traffic first. If error rates, latency, or business metrics degrade, automatic rollback. The system doesn't rely on a human watching a dashboard. This catches the production-context failures that no test suite covers.

Continuous validation. The infrastructure tests itself continuously, not just at deploy time. Load tests, chaos experiments, and invariant checks run on a schedule. When a subtle regression appears three days after deploy, the continuous validation catches it before users report it.

Executable guardrails. Encode operational knowledge as runnable tools, not documentation. A "safe rollout" isn't a wiki page explaining feature flags. It's a tool that creates the flag, generates a rollout plan with rollback conditions, and specifies verification criteria. When guardrails are executable, agents follow them autonomously.

Read-only audit agents. Deploy specialized agents that continuously verify system invariants in production. They check that the assumptions made by generative agents still hold: cache sizes are within bounds, queue depths are normal, cross-service latency is stable. They don't write code. They read production state and raise alerts.

Leverage versus rely

There's a critical distinction between leveraging agents and relying on them.

Relying means assuming that if the agent wrote it and the tests pass, it's ready to ship. The engineer never builds a mental model of the change. The result is massive PRs full of hidden assumptions that nobody truly understands.

Leveraging means using agents to iterate quickly while maintaining complete ownership of the output. You know how the code behaves under load. You've traced the happy path and the failure modes. You could explain it in a production incident.

The litmus test: would you be comfortable owning a production incident tied to this code? If yes, ship it. If no, you have more work to do.

What we've learned

Building agent-generated systems for production clients has taught us that verification is more important than generation. Any agent can write code. The hard part is proving it's safe.

Our deployment pipeline treats every agent-generated change with the same scrutiny as a first-week junior developer: automated tests, canary rollout, continuous monitoring, and automatic rollback. The only difference is we do it 100x faster than a human review cycle.

Trust isn't built by making agents smarter. It's built by making infrastructure rigorous.

We build deployment infrastructure where agent-generated code is safe by default. If your team is shipping agent output but losing sleep over production risk, we should talk.