Benchmark Results
- LangGraph won with 0.75 Tool F1 and 92% escalation accuracy
- Native implementation: 0.69 F1 — hand-written ReAct loop
- Bug fixes improved CrewAI by 45%, better prompts improved AutoGen by 66%
- Agno: 242x faster instantiation but 0.34 F1 (worst task performance)
- Framework choice mattered less than implementation quality
# Multi-turn conversation with tool calling
from customer_support import Agent
agent = Agent(tools=["order_lookup", "refund", "escalate"])
response = agent.chat("Where is my order #12345?")