Back to Archive
Issue #8 October 02, 2025 agents 5 min read

The Tool Calling Problem

"85-90% accuracy per tool call. Four or five calls? It's a coin flip."

"85-90% accuracy per tool call. Four or five calls? It's a coin flip."

That's Hugo Bowne-Anderson, summarizing what he's observed across hundreds of production agent deployments (Vanishing Gradients, 2025). The numbers are jarring. State-of-the-art tool calling—the foundation of agent architectures—operates at 85-90% reliability per step.

Do the math on a 5-step workflow and you understand why most agent projects fail.


The Compounding Problem

Reliability compounds multiplicatively, not additively. If each step succeeds 90% of the time:

Steps Success Rate
1 90%
2 81%
3 73%
4 66%
5 59%
7 48%
10 35%

A 90% per-step accuracy sounds acceptable. A 41% failure rate for a 5-step workflow does not.

This is the reliability cliff that kills agent deployments. Teams build workflows assuming each component is "pretty reliable" and discover in production that the system as a whole is a coin flip.


Where Failures Happen

In our customer support benchmark, we tracked where tool-calling agents broke down. The failure modes clustered into patterns:

Wrong Tool Selection (35% of failures)

The model has access to tools for order lookup, account status, refund processing, and escalation. It receives a message about a delayed shipment and calls the refund tool instead of order status.

The issue isn't that the model can't use tools—it's that tool selection requires understanding subtle context cues. "My order is taking forever" could be a status inquiry or a complaint requiring escalation. The model guesses. Sometimes it guesses wrong.

Wrong Parameters (25% of failures)

The model selects the correct tool but passes incorrect arguments. Order ID extraction goes wrong. Date formats don't match. Required fields are omitted or hallucinated.

Parameter extraction is a parsing problem, and parsing from natural language is inherently noisy. The model extracts what it thinks it sees, not necessarily what's there.

Missing Tool Calls (20% of failures)

The model should call a tool but doesn't. It answers from "knowledge" instead of checking the database. It skips a required validation step. It assumes rather than verifies.

This failure mode is insidious because the response often sounds confident. The model doesn't signal uncertainty—it just proceeds with incomplete information.

Hallucinated Results (20% of failures)

The model invents tool outputs that didn't happen. It fabricates order statuses. It generates tracking numbers that don't exist. In our legal extraction benchmark, CrewAI produced contract clauses that weren't in the source documents.

When tool calling fails silently, hallucination fills the gap.


Why Prompting Doesn't Fix This

The obvious response is "improve the prompts." Add more examples. Clarify tool descriptions. Include explicit instructions about when to use each tool.

We tried this. Our AutoGen implementation improved 66% after prompt engineering. But it improved from 38% to 63% Tool F1—still below production-acceptable thresholds. Better prompting helped. It didn't solve the fundamental problem.

The reliability ceiling isn't prompt engineering. It's the architecture.

Tool calling asks a language model to do three things simultaneously:

  1. Understand the user request
  2. Select the appropriate tool from a set of options
  3. Extract parameters accurately from unstructured input

Each of these tasks has its own error rate. Combined, they compound. No amount of prompting eliminates the compounding math.


What Actually Works

Teams that successfully deploy tool-calling agents don't achieve 99% per-step reliability. They design architectures that tolerate 85-90% reliability.

Reduce Step Count

The most effective intervention is reducing the number of sequential tool calls. If you can accomplish the task in 2 steps instead of 5, your failure rate drops from 41% to 19%.

This often means:

  • Consolidating multiple lookups into single database queries
  • Preprocessing data to reduce runtime decisions
  • Moving logic from agent reasoning to deterministic code

Constrain Tool Sets

Hugo's observation aligns with what Alex Strick van Linschoten found across ZenML's deployment database: "The rule of thumb is just try and constrain [tool access] as much as possible" (Vanishing Gradients, 2025).

Our best-performing agents had 5-7 tools. When we expanded to 15+, reliability dropped measurably. Each additional tool increases the decision space and the opportunity for wrong selection.

Add Verification Loops

When a tool call matters, verify it. Our legal extraction pipeline compared extracted text against source documents before accepting results. Our DevOps agent validated command outputs before proceeding to next steps.

Verification adds latency but catches failures before they propagate. The trade-off is usually worth it.

Design for Graceful Degradation

What happens when a tool call fails? In our customer support agent, failed lookups triggered graceful fallbacks:

  • Order lookup failure → offer to connect to human agent
  • Account status failure → ask customer to verify account details
  • Escalation uncertainty → default to escalation (safe failure mode)

The system stayed functional even when individual steps failed.


The Uncomfortable Truth

The tool calling reliability problem isn't going away soon. It's not a bug to be fixed—it's a characteristic of how language models interact with structured systems.

State-of-the-art is 85-90% per step. Compound that across multi-step workflows, and you understand why the most successful production agents:

  • Keep step counts low
  • Constrain tool access aggressively
  • Verify critical operations
  • Design for graceful failure

The teams building agents that work aren't achieving 99% reliability. They're designing systems that work at 85%.


The Takeaway

When evaluating agent architectures, ask: how many sequential tool calls does this require?

Every additional step is a reliability tax. A 5-step workflow at 90% per-step succeeds 59% of the time. A 3-step workflow succeeds 73% of the time. That 14-point difference isn't marginal—it's the difference between frustrating and functional.

The reliability problem is architectural, not prompt engineering. Design accordingly.


Dealing with tool calling reliability in production? Reply with what's working for you.


References

  • Bowne-Anderson, H. & Strick van Linschoten, A. (2025). "Practical Lessons from 750+ Real-World LLM and Agent Deployments." Vanishing Gradients Podcast.
  • Applied AI. (2025). Enterprise Agents Benchmark. Customer support tool calling evaluation, failure mode analysis.