Framework Shootout: What Actually Matters
We tested six agent frameworks on the same task with 250 samples. The accuracy spread was 4 percentage points. The improvement from fixing bugs was 45%.
We tested six agent frameworks on the same task with 250 samples. The accuracy spread was 4 percentage points. The improvement from fixing bugs was 45%.
Framework choice mattered less than we expected. Implementation quality mattered more than anyone admits.
The Setup
The agent framework landscape is crowded. LangGraph. CrewAI. AutoGen. DSPy. PydanticAI. Native implementations. Agno. Every vendor claims advantages. Every benchmark shows different winners.
So we ran our own comparison: same task, same dataset, same evaluation criteria, same model (GPT-4o-mini). Seven implementations tested against the BrownBox E-Commerce dataset—50 multi-turn customer support conversations requiring tool calling for order lookups, account status checks, refund processing, and escalation decisions.
The question wasn't "which framework is best?" It was "how much does framework choice actually matter?"
The Results
Customer Support (Multi-Turn, Tool Calling)
| Framework | Tool F1 | Escalation Accuracy | Notes |
|---|---|---|---|
| LangGraph | 0.75 | 92% | State machine approach |
| Native | 0.69 | 75% | Hand-coded ReAct loop |
| CrewAI | 0.65 | 75% | After routing fix |
| AutoGen | 0.63 | 89% | After prompt fix |
| Agno | 0.34 | 78% | Poor tool selection |
Customer Triage (Classification)
| Framework | Accuracy | Cost per Request |
|---|---|---|
| Native | 87.6% | $0.000074 |
| PydanticAI | 86.4% | $0.000079 |
| DSPy | 84.8% | $0.000081 |
| LangGraph | 84.4% | $0.000089 |
| CrewAI | 88.4% | $0.000171 |
Two different tasks. Same pattern: the spread between frameworks is narrow. The difference between best and worst in triage was 4 percentage points (84.4% to 88.4%). The difference in customer support Tool F1 was wider—but the story behind those numbers is more interesting.
The Real Finding: Bug Fixes > Framework Choice
Here's what the summary tables don't show.
CrewAI's journey: Initial implementation scored 0.45 Tool F1. After investigating, we found a routing bug—conversations were being misclassified before reaching the right agent. After the fix: 0.65 Tool F1. A 45% improvement from fixing one bug.
AutoGen's transformation: Initial implementation struggled with tool selection—the model kept calling wrong functions. After prompt engineering to clarify tool descriptions and add examples: 0.63 Tool F1 from 0.38. A 66% improvement from better prompts.
The spread between frameworks (0.34 to 0.75 Tool F1) was smaller than the improvement from implementation quality. In other words: a well-implemented mediocre framework outperforms a poorly-implemented excellent one.
Framework Overhead Is Real
Token consumption tells a different story. We measured equivalent tasks across implementations:
| Implementation | Tokens Used | Overhead |
|---|---|---|
| Native | 28,529 | Baseline |
| LangGraph | 54,060 | +47% |
LangGraph consumed significantly more tokens for the same task. Where does the overhead come from?
- State management boilerplate: Serializing and deserializing conversation state
- Framework prompts: System instructions that the framework injects
- Logging and tracing: Built-in observability (useful, but not free)
At scale, this compounds. If you're processing 1 million requests monthly, 47% token overhead translates directly to cost. Framework abstractions trade development convenience for runtime efficiency.
This doesn't mean frameworks are wrong—structured state management and built-in observability have real value. But the trade-off should be conscious, not accidental.
The Agno Caveat
Agno deserves special mention because it illustrates a common trap.
Agno claims 529x faster instantiation than LangGraph. Our measurements showed 242x faster instantiation and 35x lower memory usage. Impressive numbers—exactly the kind of benchmarks that get attention on Hacker News.
But Agno's Tool F1 of 0.34 was the worst in our evaluation. It couldn't reliably select the right tools for the task.
Speed metrics and task performance are orthogonal. Instantiation time doesn't matter if the agent can't complete the task. Memory efficiency is irrelevant if accuracy is unacceptable.
When evaluating frameworks, measure what matters for your use case. For most applications, that's task completion quality, not startup latency.
Cost Matters, But Not How You'd Expect
The cost spread across implementations was notable:
- Native: $0.000074 per request
- CrewAI: $0.000171 per request
CrewAI cost 2.3x more than native implementations for the same task. But CrewAI also achieved the highest accuracy (88.4% vs 87.6%).
Is 0.8 percentage points worth 2.3x cost? It depends entirely on context:
High-stakes routing: If a misroute costs $50 in agent time, and you process 100,000 requests monthly, the cost difference ($17 vs $7 per 100K) is trivial compared to the error cost savings.
Low-stakes notifications: If misroutes cause minor inconvenience, the cheaper option makes sense.
The framework isn't the decision—the business context is.
What We'd Actually Recommend
After running these comparisons, here's our practical guidance:
Pick Based on Your Constraints
Choose LangGraph if: You need explicit state management, want production observability built-in, and can afford the token overhead. It won on our customer support task.
Choose Native if: You're optimizing for cost and latency, have engineering capacity to build ReAct loops, and don't need framework abstractions. Best value in our triage task.
Choose CrewAI if: You want multi-agent coordination patterns, are willing to pay for convenience, and have use cases that benefit from role-based agent design. Highest accuracy in triage.
Avoid framework churn: Switching frameworks mid-project rarely pays off. The implementation knowledge you've built matters more than marginal framework advantages.
Then Invest in Implementation Quality
Whatever you choose, allocate time for:
Prompt engineering: Our AutoGen improvement (66%) came entirely from better prompts. Tool descriptions, system instructions, and few-shot examples matter more than framework features.
Bug hunting: Our CrewAI improvement (45%) came from fixing a routing bug. Systematic debugging beats framework switching.
Evaluation infrastructure: You can't improve what you don't measure. Build evaluation pipelines before optimizing frameworks.
The Pattern
The teams building production agents that work aren't the ones with the best framework choice. They're the ones who:
- Picked a reasonable framework
- Invested heavily in implementation quality
- Built evaluation infrastructure to measure what matters
- Iterated on prompts and debugging, not framework selection
Framework debates are a distraction. Implementation quality is the lever.
The Takeaway
If you're choosing an agent framework, here's what our data suggests:
The spread is narrow: 4 percentage points separated our best and worst triage implementations. Framework choice alone won't make or break your project.
Implementation quality dominates: 45-66% improvements came from bugs fixes and prompt engineering—not framework switches.
Measure what matters: Instantiation speed and memory usage are vanity metrics. Task completion quality is the signal.
Pick any reasonable framework. Then invest your energy where it actually moves the needle: prompts, debugging, and evaluation.
Running your own framework comparison? Reply with what you're measuring—we'd like to hear what you find.
References
- Applied AI. (2025). Enterprise Agents Benchmark. Framework comparison across customer triage (250 samples) and customer support (50 conversations), 7 implementations.