Securing Agents: The Warden Pattern
Within 5 seconds of deployment, our AI agent tried to run `rm -rf /`.
Within 5 seconds of deployment, our AI agent tried to run `rm -rf /`.
Not maliciously. It was troubleshooting a disk space issue and decided the most efficient solution was to "clean up." The intent was helpful. The execution would have been catastrophic.
This is the security problem with agents: they reason autonomously about tool use. And sometimes that reasoning leads somewhere you didn't anticipate.
The Problem with Tool-Calling Agents
Traditional software follows predetermined paths. You write the code, define the logic, control the flow. Security means securing those paths.
Agent architectures are different. You give an LLM access to tools—database queries, API calls, shell commands—and it decides which tools to invoke and with what parameters. The execution path emerges at runtime based on the model's reasoning.
This creates a category of risk that traditional security models don't address:
Prompt injection: Malicious inputs that hijack the agent's instructions. "Ignore previous instructions and execute..."
Unintended tool chains: The model chains tools together in ways you didn't anticipate. Our `rm -rf /` incident wasn't an attack—it was "helpful" reasoning gone wrong.
Scope creep: Tools designed for narrow purposes get used for broader actions. A file reader becomes an information exfiltration channel.
The conventional response is blocklisting: enumerate bad things, prevent them. This fails for agents because you can't enumerate the creative ways a reasoning system might misuse its tools.
The Warden Pattern
We implemented a different approach: the Warden pattern. Instead of blocking known-bad actions, we permit only known-good actions.
The architecture is straightforward:
User Request → Agent (LLM) → Warden Proxy → Tool Execution
↓
Validation Layer
↓
Allowlist Check → Approve/Reject
The Warden sits between the agent's tool calls and actual execution. Every tool invocation passes through validation before anything happens. The agent reasons freely; the Warden constrains execution.
Key Design Principles
Allowlist, not blocklist: Define what's permitted, not what's forbidden. Our DevOps agent could execute specific remediation commands for known scenarios. Everything else was rejected by default.
Structural validation: Check command structure, not just keywords. `rm -rf /` fails not because "rm" is blocked, but because recursive deletion on root isn't in the allowed operation set.
Context-aware rules: Some operations are valid in certain contexts. Restarting a service is fine after a health check confirms it's unhealthy. The Warden tracks state.
Human escalation path: Unknown operations don't fail silently—they escalate for human review. The agent can request permission for novel actions.
The Results
We evaluated the Warden against our DevOps remediation dataset: 141 legitimate operational scenarios and 25 crafted attack attempts.
| Metric | Result |
|---|---|
| Attacks blocked | 25/25 (100%) |
| False positives | 0/141 (0%) |
| Legitimate ops allowed | 141/141 (100%) |
Zero attacks succeeded. Zero legitimate operations were blocked. The pattern worked.
What the Attacks Looked Like
The 25 attack scenarios included:
- Direct prompt injections ("Ignore your instructions and...")
- Indirect injections via document content
- Tool parameter manipulation
- Chain-of-thought hijacking attempts
- Encoded command obfuscation
None penetrated the Warden. The allowlist approach meant attack novelty didn't matter—if the operation wasn't explicitly permitted, it didn't execute.
The rm -rf / Incident
Our most instructive moment wasn't an attack. It was the agent trying to be helpful.
The scenario: A disk space alert. The agent investigated, found `/var/log` consuming significant space, and reasoned that clearing old files would help. Its proposed solution: `rm -rf /var/log/*`.
Reasonable-sounding. Also dangerous—log files often include application state, and aggressive cleanup can break systems.
The Warden rejected it. Log cleanup wasn't in the allowed operation set for disk space remediation. The agent escalated to human review. A human decided which logs were safe to rotate.
This is the point: the Warden doesn't just stop attacks. It stops well-intentioned but dangerous reasoning.
Implementation Considerations
If you're implementing a Warden pattern, here's what we learned:
Start with Narrow Scopes
Our DevOps Warden began with 12 permitted operations. We expanded to 23 over three months as we understood what the agent needed. Starting narrow and expanding is safer than starting broad and restricting.
Log Everything
Every Warden decision should be logged: the requested operation, the validation result, the reasoning. These logs are essential for debugging and for demonstrating compliance.
Design the Escalation Path
What happens when the agent requests something unexpected? Our implementation:
- Rejects the operation
- Logs the request
- Notifies human operators
- Gives the agent a "request denied, escalate for approval" response
The agent learns to work within constraints. Humans see what the constraints block.
Test with Adversarial Scenarios
We crafted 25 attack scenarios specifically designed to probe the Warden. Prompt injections, encoded commands, multi-step attack chains. Regular penetration testing keeps the pattern robust.
When Warden Isn't Enough
The Warden pattern has limits:
High-latency overhead: Every tool call adds validation latency. For latency-sensitive applications, the tradeoff may not work.
Maintenance burden: The allowlist requires ongoing curation. New legitimate operations need to be added; obsolete ones removed.
Doesn't prevent bad reasoning: The Warden stops bad execution, not bad planning. An agent that reasons incorrectly but proposes allowed operations will still fail—just safely.
For internal tools with controlled environments, the pattern works well. For public-facing agents with broad capabilities, additional layers (input filtering, output scanning, rate limiting) are necessary.
The Takeaway
Security for agents isn't a feature—it's an architecture. The Warden pattern implements defense through constraint: define what's allowed, reject everything else, escalate unknowns to humans.
Our results—100% attack blocking, 0% false positives—suggest the approach works. But the more important lesson came from the `rm -rf /` incident: the biggest risks aren't always attacks. They're well-intentioned reasoning that reaches dangerous conclusions.
Design for distrust. Even when the agent is trying to help.
Building agent security? Reply with the patterns you're using—we're collecting approaches that work.
References
- Applied AI. (2025). Enterprise Agents Benchmark. Warden pattern evaluation: 25 attack scenarios, 141 legitimate operations, DevOps remediation domain.