Back to Archive
Issue #7 September 18, 2025 agents 6 min read

When Agents Are Overkill

Keyword routing handled 69% of our customer triage requests for $0.

Keyword routing handled 69% of our customer triage requests for $0.

No LLM. No agent framework. No multi-step reasoning. A lookup table checked for keywords like "refund," "shipping," and "cancel," then routed messages to the right department. Two-thirds of inbound support requests resolved by something that could run on a 1990s server.

This wasn't a failure. It was the most successful part of our system.


The Agent Reflex

Walk into any AI project kickoff and you'll hear the same conversation. "We need an agent for that." Customer support? Agent. Document processing? Agent. Internal routing? Multi-agent system with tool orchestration.

The logic seems reasonable. Large language models can reason. They can use tools. They can adapt to ambiguous inputs. Why wouldn't you want an autonomous system handling complex tasks?

Because most tasks aren't complex.

The agent hype has created a blind spot. Teams reach for sophisticated orchestration frameworks before asking whether a decision tree would suffice. They implement multi-step reasoning chains for problems that a single API call handles. They build elaborate architectures when pattern matching would do the job.

We built a customer triage system specifically to test this assumption. The results were instructive.


The Experiment

We used the Bitext 26K dataset—26,000 real customer support messages requiring routing to departments like orders, billing, accounts, and shipping. Our evaluation used 250 samples across six different implementations, ranging from simple keyword matching to multi-agent frameworks.

The Results

Implementation Accuracy Cost per Request
Keywords only 68.6% $0
Single LLM call 87.6% $0.000074
DSPy (optimized) 84.8% $0.000081
PydanticAI 86.4% $0.000079
LangGraph 84.4% $0.000089
CrewAI 88.4% $0.000171

The first row is the one that matters most.

Keyword routing—the simplest possible approach—handled 69% of requests correctly at zero cost. No model inference, no API calls, no token consumption. For high-volume systems processing thousands of requests daily, that baseline changes everything about unit economics.


What the Numbers Mean

The 69% baseline: Nearly seven in ten customer messages contain clear signals. "I want a refund" goes to billing. "Where's my package" goes to shipping. These don't need reasoning—they need routing. The obvious cases should be handled deterministically.

The 4-point ceiling: Across five different AI implementations, accuracy ranged from 84.4% to 88.4%. Four percentage points. We tested native code, three frameworks, and two prompt optimization approaches. Framework choice mattered less than expected.

The 2.3x cost gap: CrewAI achieved the highest accuracy (88.4%) at 2.3x the cost of native implementations. Whether that trade-off makes sense depends on context. If a misroute costs $50 in agent time, the extra accuracy might justify the expense. For low-stakes routing, it won't.

The hybrid math: In production, we'd recommend a tiered approach:

  • Tier 0 (keywords): Handle 69% at $0
  • Tier 1 (single LLM): Handle the ambiguous 31%
  • Tier 2 (agent escalation): True edge cases only

You're not paying for AI reasoning on messages where "refund" clearly maps to the billing queue.


The Sample Size Trap

One finding deserves special emphasis. Our pilot evaluation with 50 samples showed 94% accuracy. Our full 250-sample evaluation revealed 84-88% accuracy.

The gap is instructive. Small pilots systematically overestimate production performance. Edge cases appear at the margins of your distribution—you won't see them in 50 samples. Teams run a quick pilot, get excited about 94% accuracy, then discover reality is 10 points lower when they scale.

Any serious evaluation needs sample sizes that capture the variance in your real distribution. Point estimates without confidence intervals are noise dressed as signal.


The Reliability Cliff

There's a mathematical argument for simpler architectures that teams often overlook.

If each step in an agent workflow succeeds 90% of the time:

  • 1 step: 90% success
  • 2 steps: 81% success
  • 3 steps: 73% success
  • 5 steps: 59% success

Hugo Bowne-Anderson put it directly: "85-90% accuracy per tool call. Four or five calls? It's a coin flip" (Vanishing Gradients, 2025).

This is the reliability cliff. A 90% per-step accuracy sounds acceptable until you realize a 5-step workflow fails 41% of the time. Multi-agent architectures multiply failure points. Simpler architectures have fewer places to break.

The math favors boring.


When You Actually Need Agents

None of this means agents are useless. They're useful for the right problems:

Genuine complexity: Tasks where the execution path cannot be predetermined. Our DevOps remediation system—diagnosing cluster failures and executing fixes—required multi-step reasoning because the "right" action depended on runtime observations.

External tool orchestration: When you need to combine reasoning with database lookups, API calls, and actions across systems. Our customer support agent used tools for order lookup, account status, and refund processing.

Adaptation under uncertainty: Scenarios where the agent must adjust based on intermediate results. Research tasks. Investigation workflows. Problems where you can't write a flowchart covering 80% of cases.

The pattern: agents earn their complexity when the task genuinely requires planning and adaptation. Not because "AI should handle this" but because simpler approaches demonstrably fail.


The Boring Solution Principle

Alex Strick van Linschoten, reflecting on 750+ production deployments, summarized it well: "If you can get away with not having something which is fully or semi-autonomous, then you really should and it'll be much easier to debug and evaluate and improve" (Vanishing Gradients, 2025).

The boring solution that works reliably beats the exciting solution that might work. Keyword routing isn't glamorous, but it:

  • Costs nothing at scale
  • Never hallucinates
  • Fails predictably
  • Debugs trivially
  • Runs in milliseconds

When it handles 69% of your volume, you've eliminated 69% of your AI complexity, cost, and failure modes. The remaining 31% can use the sophisticated approach—targeted where it actually adds value.


The Takeaway

Before reaching for agents, ask: what percentage of cases can simpler methods handle?

Build the baseline first. Measure it. Often you'll find that keyword matching, decision trees, or single API calls solve 60-70% of the problem. Agent architectures should handle the residual complexity—not replace systems that already work.

The teams building production AI that survives contact with users aren't the ones with the most sophisticated architectures. They're the ones who matched solution complexity to problem complexity.

Most problems need less than you think.


Have a question about tier selection for your use case? Reply to this email.


References

  • Applied AI. (2025). Enterprise Agents Benchmark. Customer triage evaluation, 250 samples across 6 implementations.
  • Bowne-Anderson, H. & Strick van Linschoten, A. (2025). "Practical Lessons from 750+ Real-World LLM and Agent Deployments." Vanishing Gradients Podcast.