Framework Shootout: What Actually Matters

We tested six agent frameworks on the same task with 250 samples. The accuracy spread was 4 percentage points. The improvement from fixing bugs was 45%.

Framework choice mattered less than we expected. Implementation quality mattered more than anyone admits.

The Setup

The agent framework landscape is crowded. LangGraph. CrewAI. AutoGen. DSPy. PydanticAI. Native implementations. Agno. Every vendor claims advantages. Every benchmark shows different winners.

So we ran our own comparison: same task, same dataset, same evaluation criteria, same model (GPT-4o-mini). Seven implementations tested against the BrownBox E-Commerce dataset—50 multi-turn customer support conversations requiring tool calling for order lookups, account status checks, refund processing, and escalation decisions.

The question wasn't "which framework is best?" It was "how much does framework choice actually matter?"

The Results

Customer Support (Multi-Turn, Tool Calling)

Framework	Tool F1	Escalation Accuracy	Notes
LangGraph	0.75	92%	State machine approach
Native	0.69	75%	Hand-coded ReAct loop
CrewAI	0.65	75%	After routing fix
AutoGen	0.63	89%	After prompt fix
Agno	0.34	78%	Poor tool selection

Customer Triage (Classification)

Framework	Accuracy	Cost per Request
Native	87.6%	$0.000074
PydanticAI	86.4%	$0.000079
DSPy	84.8%	$0.000081
LangGraph	84.4%	$0.000089
CrewAI	88.4%	$0.000171

Two different tasks. Same pattern: the spread between frameworks is narrow. The difference between best and worst in triage was 4 percentage points (84.4% to 88.4%). The difference in customer support Tool F1 was wider—but the story behind those numbers is more interesting.

The Real Finding: Bug Fixes > Framework Choice

Here's what the summary tables don't show.

CrewAI's journey: Initial implementation scored 0.45 Tool F1. After investigating, we found a routing bug—conversations were being misclassified before reaching the right agent. After the fix: 0.65 Tool F1. A 45% improvement from fixing one bug.

AutoGen's transformation: Initial implementation struggled with tool selection—the model kept calling wrong functions. After prompt engineering to clarify tool descriptions and add examples: 0.63 Tool F1 from 0.38. A 66% improvement from better prompts.

The spread between frameworks (0.34 to 0.75 Tool F1) was smaller than the improvement from implementation quality. In other words: a well-implemented mediocre framework outperforms a poorly-implemented excellent one.

Framework Overhead Is Real

Token consumption tells a different story. We measured equivalent tasks across implementations:

Implementation	Tokens Used	Overhead
Native	28,529	Baseline
LangGraph	54,060	+47%

LangGraph consumed significantly more tokens for the same task. Where does the overhead come from?

State management boilerplate: Serializing and deserializing conversation state
Framework prompts: System instructions that the framework injects
Logging and tracing: Built-in observability (useful, but not free)

At scale, this compounds. If you're processing 1 million requests monthly, 47% token overhead translates directly to cost. Framework abstractions trade development convenience for runtime efficiency.

This doesn't mean frameworks are wrong—structured state management and built-in observability have real value. But the trade-off should be conscious, not accidental.

The Agno Caveat

Agno deserves special mention because it illustrates a common trap.

Agno claims 529x faster instantiation than LangGraph. Our measurements showed 242x faster instantiation and 35x lower memory usage. Impressive numbers—exactly the kind of benchmarks that get attention on Hacker News.

But Agno's Tool F1 of 0.34 was the worst in our evaluation. It couldn't reliably select the right tools for the task.

Speed metrics and task performance are orthogonal. Instantiation time doesn't matter if the agent can't complete the task. Memory efficiency is irrelevant if accuracy is unacceptable.

When evaluating frameworks, measure what matters for your use case. For most applications, that's task completion quality, not startup latency.

Cost Matters, But Not How You'd Expect

The cost spread across implementations was notable:

Native: $0.000074 per request
CrewAI: $0.000171 per request

CrewAI cost 2.3x more than native implementations for the same task. But CrewAI also achieved the highest accuracy (88.4% vs 87.6%).

Is 0.8 percentage points worth 2.3x cost? It depends entirely on context:

High-stakes routing: If a misroute costs $50 in agent time, and you process 100,000 requests monthly, the cost difference ($17 vs $7 per 100K) is trivial compared to the error cost savings.

Low-stakes notifications: If misroutes cause minor inconvenience, the cheaper option makes sense.

The framework isn't the decision—the business context is.

What We'd Actually Recommend

After running these comparisons, here's our practical guidance:

Pick Based on Your Constraints

Choose LangGraph if: You need explicit state management, want production observability built-in, and can afford the token overhead. It won on our customer support task.

Choose Native if: You're optimizing for cost and latency, have engineering capacity to build ReAct loops, and don't need framework abstractions. Best value in our triage task.

Choose CrewAI if: You want multi-agent coordination patterns, are willing to pay for convenience, and have use cases that benefit from role-based agent design. Highest accuracy in triage.

Avoid framework churn: Switching frameworks mid-project rarely pays off. The implementation knowledge you've built matters more than marginal framework advantages.

Then Invest in Implementation Quality

Whatever you choose, allocate time for:

Prompt engineering: Our AutoGen improvement (66%) came entirely from better prompts. Tool descriptions, system instructions, and few-shot examples matter more than framework features.

Bug hunting: Our CrewAI improvement (45%) came from fixing a routing bug. Systematic debugging beats framework switching.

Evaluation infrastructure: You can't improve what you don't measure. Build evaluation pipelines before optimizing frameworks.

The Pattern

The teams building production agents that work aren't the ones with the best framework choice. They're the ones who:

Picked a reasonable framework
Invested heavily in implementation quality
Built evaluation infrastructure to measure what matters
Iterated on prompts and debugging, not framework selection

Framework debates are a distraction. Implementation quality is the lever.

The Takeaway

If you're choosing an agent framework, here's what our data suggests:

The spread is narrow: 4 percentage points separated our best and worst triage implementations. Framework choice alone won't make or break your project.

Implementation quality dominates: 45-66% improvements came from bugs fixes and prompt engineering—not framework switches.

Measure what matters: Instantiation speed and memory usage are vanity metrics. Task completion quality is the signal.

Pick any reasonable framework. Then invest your energy where it actually moves the needle: prompts, debugging, and evaluation.

Running your own framework comparison? Reply with what you're measuring—we'd like to hear what you find.

References

Applied AI. (2025). Enterprise Agents Benchmark. Framework comparison across customer triage (250 samples) and customer support (50 conversations), 7 implementations.