Enterprise Agents

MIT License · View on GitHub

Four experiments exploring different automation scenarios—from keyword matching to contract analysis to system remediation. Each one asks: should we use AI here?

The Agent Complexity Spectrum

Start simple. Add complexity only when the simpler approach fails.

Tier 0

Rules & Heuristics

Deterministic logic, no AI. Start here.

Tier 1

Single LLM Call

Prompt engineering. When rules can't flex.

Tier 2

Tool-Using Agent

LLM + external tools. Needs real-time data.

Tier 3

Multi-Agent

Multiple specialists. When Tier 2 hits ceiling.

← Simpler · More Reliable · Cheaper More Powerful · More Flexible · Costlier →
We built four agents at different complexity levels and measured what happened. Each experiment tests a real automation scenario with quantified results: cost per request, accuracy, latency, and failure modes. The goal isn't to prove agents are good or bad—it's to help you decide whether automation makes sense for your specific use case, and if so, what level of sophistication is actually warranted.

Key Findings

  • Rules alone hit 69% accuracy at zero cost. Adding AI boosts to 88%—for under a penny per 100 requests.
  • Framework choice matters less than implementation quality. Fixing bugs improved CrewAI by 45%.
  • Bigger models aren't always better. GPT-4o-mini beat GPT-4o on legal extraction—at 10x lower cost.
  • Agents will try dangerous things. Within 5 seconds, ours attempted rm -rf /. The Warden pattern stopped it.
  • Automate the 77%, use AI for edge cases. One scenario saves $80K/year with this hybrid approach.

Questions about this project? Open an issue on GitHub or contact us directly.