Key Findings
- Rules alone hit 69% accuracy at zero cost. Adding AI boosts to 88%—for under a penny per 100 requests.
- Framework choice matters less than implementation quality. Fixing bugs improved CrewAI by 45%.
- Bigger models aren't always better. GPT-4o-mini beat GPT-4o on legal extraction—at 10x lower cost.
- Agents will try dangerous things. Within 5 seconds, ours attempted rm -rf /. The Warden pattern stopped it.
- Automate the 77%, use AI for edge cases. One scenario saves $80K/year with this hybrid approach.