Production Patterns That Survive

"Roughly 80% of the effort is spent not on the agent's core intelligence, but on the infrastructure, security, and validation needed to make it reliable and safe."

That's from Google's "Prototype to Production" paper (2025), and it matches everything we've observed. The hard part of agent deployment isn't building the agent. It's everything around the agent.

This newsletter covers the patterns that survive contact with production—the infrastructure decisions that separate demos from systems that work.

The Production Gap

Most agent projects die in the gap between "works on my laptop" and "works in production." The demos are impressive. The pilots succeed. Then deployment fails because teams underestimate what production requires.

Common failure modes:

No rollback capability: Agent behavior degrades, but there's no way to revert to a known-good version quickly. Teams scramble to debug while users suffer.

Insufficient observability: Something goes wrong, but no one can diagnose why. Traces don't capture the full reasoning chain. Logs don't include the right context.

Evaluation stops at launch: The agent tested well before deployment. Six weeks later, it's drifting, but no one notices because continuous evaluation doesn't exist.

Security as afterthought: The agent gains tool access in development. Security review happens at launch—or doesn't happen at all.

Google's 80% estimate is instructive: plan for production infrastructure to consume most of your effort.

Pattern 1: Evaluation-Gated Deployment

The most important production pattern is evaluation-gated CI/CD. No agent version reaches users without passing quality gates.

How It Works

Pre-merge evaluation: Every pull request triggers the evaluation suite. Changes that degrade performance don't merge.

Golden dataset maintenance: A curated set of test cases that represent real usage. The dataset evolves as you discover new failure modes.

Automated scoring: LLM-as-judge evaluations with consistent rubrics. Metrics for task completion, tool accuracy, safety compliance.

Block on regression: If key metrics fall below thresholds, deployment stops automatically. No human judgment required for clear failures.

What We Learned

Our customer support evaluation used 50 conversations across 7 framework implementations. We caught the CrewAI routing bug (which would have degraded production accuracy 45%) and the AutoGen tool selection issue (which would have degraded accuracy 66%).

Pre-merge evaluation pays for itself quickly. The alternative—discovering regressions in production—costs more in every dimension.

Pattern 2: Safe Rollout Strategies

Even with evaluation gates, production reveals issues that testing doesn't. Safe rollout strategies limit blast radius.

Canary Deployments

Route 1-5% of traffic to the new version. Monitor closely. If metrics hold, gradually increase. If they degrade, roll back immediately.

Key metrics to watch:

Task completion rate
Error rates by error type
Latency percentiles (p50, p95, p99)
Tool call success rates
Escalation rates

Our DevOps remediation agent used canary deployment with 5% initial traffic. We caught a latency regression (p95 went from 2.3s to 4.7s) before it affected most users.

Feature Flags

Deploy the code but control activation dynamically. Enable new agent capabilities for internal users first, then beta testers, then general availability.

Feature flags also enable instant rollback without redeployment. When the `rm -rf /` incident occurred in our DevOps agent, a feature flag would have allowed immediate capability restriction.

Blue-Green Deployment

Maintain two identical production environments. Deploy to the inactive one, test, then switch traffic. If problems emerge, switch back instantly.

This pattern works well for agents with complex state management, where canary splits create consistency challenges.

Pattern 3: Observability That Works

"Debug with OpenTelemetry Traces: Answering 'Why?'" is how Google frames agent observability. You need to reconstruct the agent's reasoning path when something goes wrong.

The Three Pillars

Traces: The narrative connecting individual operations. When a customer complaint escalates incorrectly, the trace shows: user message → intent classification → tool selection → escalation decision → human handoff. Each step is visible.

Logs: The granular record of what happened. Every tool call, every model response, every decision point. Structured logging makes search possible.

Metrics: Aggregated indicators of system health. Task completion rates, latency distributions, error frequencies, cost per request.

What to Capture

From our implementations, the critical trace data includes:

Full prompts sent to models (anonymized as needed)
Model responses including reasoning chains
Tool calls with parameters and results
Decision points and which branch was taken
Timing for each component
Token counts and costs

When our legal extraction agent started hallucinating after 3 documents, traces revealed the pattern: context window was filling with previous results, pushing out source document content.

Pattern 4: The Warden for Security

Security for agents requires architectural enforcement, not just policy. We covered the Warden pattern in detail in Newsletter #8, but the production implications deserve emphasis.

Defense in Depth

Layer 1: Input filtering. Block malicious prompts before they reach the agent. Classifiers for prompt injection, content policy violations.

Layer 2: The Warden. Validate tool calls before execution. Allowlist permitted operations. Reject by default.

Layer 3: Output filtering. Scan responses for sensitive data, policy violations, safety issues before they reach users.

Layer 4: Human-in-the-loop. High-stakes decisions pause for human approval (see Newsletter #12).

Our Results

The Warden achieved 100% attack blocking with 0% false positives across 25 attack scenarios and 141 legitimate operations. That performance held in production because we maintained the allowlist conservatively—new operations required explicit approval.

Pattern 5: Evolution from Production

Google's whitepaper describes the "Observe → Act → Evolve" cycle. Production data improves the agent continuously.

The Feedback Loop

Capture failures: Every escalation, error, and user complaint becomes a potential test case.

Expand the golden dataset: Production failures that reveal gaps in evaluation become permanent test cases.

Refine and redeploy: Improvements pass through the evaluation gate and deploy through safe rollout.

Practical Implementation

Our customer triage agent started with 50 evaluation samples. After three months in production, the golden dataset grew to 180 samples—each addition representing a failure mode we discovered and addressed.

The loop compounds: better evaluation catches more issues pre-deployment, which means fewer production failures, which means the failures that do occur are more informative.

The Investment

These patterns require upfront investment:

CI/CD infrastructure: Build pipelines, evaluation harnesses, deployment automation
Observability stack: Tracing, logging, metrics, dashboards, alerts
Golden dataset curation: Ongoing maintenance as you discover new failure modes
Security architecture: Warden implementation, input/output filtering, access control

Google's 80% estimate accounts for this. Teams that expect "agent intelligence" to be 80% of the work are surprised when infrastructure dominates.

The Takeaway

Production-grade agents aren't agents that work better. They're agents wrapped in infrastructure that makes them work reliably.

The intelligence is the easy part—pick a reasonable model, write decent prompts, connect appropriate tools. The hard part is everything else: evaluation gates, safe rollouts, comprehensive observability, architectural security, continuous evolution.

The patterns that survive production are the ones that assume production will try to break things. Build for that assumption.

Deploying agents to production? Reply with the patterns that work for you—we're building a collection.

References

Google. (2025). "Prototype to Production." Google AI Whitepaper Series.
Applied AI. (2025). Enterprise Agents Benchmark. Production deployment patterns across 4 agent implementations.