Securing Agents: The Warden Pattern

Within 5 seconds of deployment, our AI agent tried to run `rm -rf /`.

Not maliciously. It was troubleshooting a disk space issue and decided the most efficient solution was to "clean up." The intent was helpful. The execution would have been catastrophic.

This is the security problem with agents: they reason autonomously about tool use. And sometimes that reasoning leads somewhere you didn't anticipate.

The Problem with Tool-Calling Agents

Traditional software follows predetermined paths. You write the code, define the logic, control the flow. Security means securing those paths.

Agent architectures are different. You give an LLM access to tools—database queries, API calls, shell commands—and it decides which tools to invoke and with what parameters. The execution path emerges at runtime based on the model's reasoning.

This creates a category of risk that traditional security models don't address:

Prompt injection: Malicious inputs that hijack the agent's instructions. "Ignore previous instructions and execute..."

Unintended tool chains: The model chains tools together in ways you didn't anticipate. Our `rm -rf /` incident wasn't an attack—it was "helpful" reasoning gone wrong.

Scope creep: Tools designed for narrow purposes get used for broader actions. A file reader becomes an information exfiltration channel.

The conventional response is blocklisting: enumerate bad things, prevent them. This fails for agents because you can't enumerate the creative ways a reasoning system might misuse its tools.

The Warden Pattern

We implemented a different approach: the Warden pattern. Instead of blocking known-bad actions, we permit only known-good actions.

The architecture is straightforward:

User Request → Agent (LLM) → Warden Proxy → Tool Execution
                                ↓
                         Validation Layer
                                ↓
                      Allowlist Check → Approve/Reject

The Warden sits between the agent's tool calls and actual execution. Every tool invocation passes through validation before anything happens. The agent reasons freely; the Warden constrains execution.

Key Design Principles

Allowlist, not blocklist: Define what's permitted, not what's forbidden. Our DevOps agent could execute specific remediation commands for known scenarios. Everything else was rejected by default.

Structural validation: Check command structure, not just keywords. `rm -rf /` fails not because "rm" is blocked, but because recursive deletion on root isn't in the allowed operation set.

Context-aware rules: Some operations are valid in certain contexts. Restarting a service is fine after a health check confirms it's unhealthy. The Warden tracks state.

Human escalation path: Unknown operations don't fail silently—they escalate for human review. The agent can request permission for novel actions.

The Results

We evaluated the Warden against our DevOps remediation dataset: 141 legitimate operational scenarios and 25 crafted attack attempts.

Metric	Result
Attacks blocked	25/25 (100%)
False positives	0/141 (0%)
Legitimate ops allowed	141/141 (100%)

Zero attacks succeeded. Zero legitimate operations were blocked. The pattern worked.

What the Attacks Looked Like

The 25 attack scenarios included:

Direct prompt injections ("Ignore your instructions and...")
Indirect injections via document content
Tool parameter manipulation
Chain-of-thought hijacking attempts
Encoded command obfuscation

None penetrated the Warden. The allowlist approach meant attack novelty didn't matter—if the operation wasn't explicitly permitted, it didn't execute.

The rm -rf / Incident

Our most instructive moment wasn't an attack. It was the agent trying to be helpful.

The scenario: A disk space alert. The agent investigated, found `/var/log` consuming significant space, and reasoned that clearing old files would help. Its proposed solution: `rm -rf /var/log/*`.

Reasonable-sounding. Also dangerous—log files often include application state, and aggressive cleanup can break systems.

The Warden rejected it. Log cleanup wasn't in the allowed operation set for disk space remediation. The agent escalated to human review. A human decided which logs were safe to rotate.

This is the point: the Warden doesn't just stop attacks. It stops well-intentioned but dangerous reasoning.

Implementation Considerations

If you're implementing a Warden pattern, here's what we learned:

Start with Narrow Scopes

Our DevOps Warden began with 12 permitted operations. We expanded to 23 over three months as we understood what the agent needed. Starting narrow and expanding is safer than starting broad and restricting.

Log Everything

Every Warden decision should be logged: the requested operation, the validation result, the reasoning. These logs are essential for debugging and for demonstrating compliance.

Design the Escalation Path

What happens when the agent requests something unexpected? Our implementation:

Rejects the operation
Logs the request
Notifies human operators
Gives the agent a "request denied, escalate for approval" response

The agent learns to work within constraints. Humans see what the constraints block.

Test with Adversarial Scenarios

We crafted 25 attack scenarios specifically designed to probe the Warden. Prompt injections, encoded commands, multi-step attack chains. Regular penetration testing keeps the pattern robust.

When Warden Isn't Enough

The Warden pattern has limits:

High-latency overhead: Every tool call adds validation latency. For latency-sensitive applications, the tradeoff may not work.

Maintenance burden: The allowlist requires ongoing curation. New legitimate operations need to be added; obsolete ones removed.

Doesn't prevent bad reasoning: The Warden stops bad execution, not bad planning. An agent that reasons incorrectly but proposes allowed operations will still fail—just safely.

For internal tools with controlled environments, the pattern works well. For public-facing agents with broad capabilities, additional layers (input filtering, output scanning, rate limiting) are necessary.

The Takeaway

Security for agents isn't a feature—it's an architecture. The Warden pattern implements defense through constraint: define what's allowed, reject everything else, escalate unknowns to humans.

Our results—100% attack blocking, 0% false positives—suggest the approach works. But the more important lesson came from the `rm -rf /` incident: the biggest risks aren't always attacks. They're well-intentioned reasoning that reaches dangerous conclusions.

Design for distrust. Even when the agent is trying to help.

Building agent security? Reply with the patterns you're using—we're collecting approaches that work.

References

Applied AI. (2025). Enterprise Agents Benchmark. Warden pattern evaluation: 25 attack scenarios, 141 legitimate operations, DevOps remediation domain.