The Evidence Problem
We analyzed 598 AI case studies published over the past two years. None had rigorous evidence.
We analyzed 598 AI case studies published over the past two years. None had rigorous evidence.
Zero percent. No randomized controlled trials. No A/B tests with statistical significance. No control groups.
This isn't a sampling error. We read through ZenML's curated aggregation of enterprise AI case studies—the largest public collection of its kind—and systematically classified them by evidence quality (Applied AI, 2025). The results weren't subtle.
The Numbers
| Evidence Quality | Percentage |
|---|---|
| Rigorous (RCT, A/B with controls) | 0% |
| Strong experimental | 9.5% |
| Quasi-experimental | 4.7% |
| Before/after only | 16.7% |
| Anecdotal only | 65.7% |
Two-thirds of enterprise AI case studies provide no measurable evidence at all. The typical "case study" reads like this: "We implemented an AI solution. It transformed our operations. Our team is excited about the results."
What does "transformed" mean? By how much? Compared to what?
The Showcase Problem
The deeper issue isn't just methodology—it's incentives. When we classified the tone and framing of these articles:
- 90.5% were explicit marketing showcases
- 84.6% had an overwhelmingly positive tone
- Only 1.5% were critical
- Only 31.3% mentioned any failures at all
This corpus represents survivor bias at scale. Failed projects don't get blog posts. The 88 articles with critical or mixed tone are disproportionately valuable—and disproportionately rare.
One article from Outropy documented "commercial failure despite technical superiority." Another from eSpark admitted their "conversational interface failed with users." Microsoft's own team described agentic solutions as "brittle, hard to debug, and unpredictable."
These are the honest voices. They remain a minority of the conversation.
What This Means
The enterprise AI industry is making billion-dollar decisions based on anecdotes. When a VP of Engineering asks "Should we implement RAG for our support team?" they'll find dozens of case studies claiming success. What they won't find: evidence that it will work for their use case, their data, their users.
The math is simple but uncomfortable:
- Successful implementations get written up
- Failures stay quiet
- Case studies have no control groups
- Therefore: we have no idea what the actual success rate is
An implementation that improved metrics by 30% might have improved them by 25% due to the attention effect alone. We have no way to separate signal from noise.
The Tech Echo Chamber
The bias extends to who's even represented. Our analysis found:
| Industry | Share of Case Studies |
|---|---|
| Tech | 53.7% |
| Finance | 10.0% |
| E-commerce | 8.0% |
| Healthcare | 7.0% |
| Other | 21.3% |
Over half of enterprise AI case studies come from tech companies writing for tech audiences. Healthcare, manufacturing, government—industries with different constraints, different data, different risk profiles—are barely present in the conversation.
When a tech company shares their chatbot success story, it tells healthcare almost nothing about whether AI works for patient intake. Different domains, different stakes, different regulatory environments.
The Evidence We Actually Have
Among the rare articles with quantitative claims, the metrics that showed up most often:
- Accuracy (22 mentions)
- Latency (9 mentions)
- Cost (7 mentions)
- Throughput (4 mentions)
LinkedIn reported 4x throughput improvement and -66% P90 latency. Prosus claimed -98% token costs and -9 percentage points in hallucination rate. Apollo Tyres noted -88% analysis time.
These numbers exist. They're specific. But without control conditions—without knowing what the baseline alternative would have achieved—they're still hard to interpret. A 66% latency reduction could mean going from "unacceptably slow" to "acceptable," or from "fast" to "faster." Context matters.
The Uncomfortable Question
Here's what we're not saying: that enterprise AI doesn't work. It might work. It might work very well. The problem is we don't have the evidence to know.
The industry has shipped first and asked questions later. That's not inherently wrong—sometimes you learn by doing. But the absence of rigorous evaluation means we're accumulating case studies without accumulating knowledge.
Every team that ships an AI feature is essentially running an uncontrolled experiment. Some will succeed for reasons they don't understand. Some will fail for reasons they can't diagnose. And the next team to face the same problem will have no better information than the one before.
What Would Good Look Like
Rigorous evaluation in enterprise AI isn't impossible. It's just not prioritized. Here's what it would require:
For implementers:
- A/B tests with holdout groups before declaring victory
- Baseline measurements of human-only performance
- Statistical significance testing on claimed improvements
- Documentation of what didn't work
For the industry:
- More honest failure documentation (like the 88 articles we found)
- Standardized metrics that allow comparison across implementations
- Repositories of negative results, not just successes
- Peer review for extraordinary claims
None of this is exotic methodology. It's what the pharmaceutical industry has done for decades. It's what mature engineering disciplines expect. AI has just skipped the step where we prove things work.
The Path Forward
The ZenML corpus isn't definitive—it has its own selection bias, focused on teams sophisticated enough to write technical blog posts. The true evidence gap in enterprise AI is likely worse than what we found.
But the 88 critical articles offer something valuable: honesty about what's hard. Microsoft calling agents "unpredictable." Dropbox noting that "models don't use full context well; irrelevant context degrades quality." Bismuth documenting a 97% false positive rate in their initial agent implementation.
These admissions are more useful than a hundred success stories. They tell you what to watch for. They tell you what might go wrong.
As we build out evaluation frameworks and implementation guides, we're starting from this baseline: the industry is flying blind. The evidence that AI works in enterprise settings is weak, biased toward survivors, and missing the control conditions that would let us distinguish cause from correlation.
We can do better. But first we have to admit where we are.
Have a question? Reply to this email.
References
- Applied AI. (2025). ZenML Case Study Meta-Analysis. Analysis of 598 articles from ZenML's curated blog aggregation, 2023-2024.
- Microsoft. (2024). Cited in ZenML aggregation: "Agentic solutions described as brittle, hard to debug, and unpredictable."
- Outropy. (2024). Cited in ZenML aggregation: "Commercial failure despite technical superiority."
- Dropbox. (2024). Cited in ZenML aggregation: "Models don't use full context well; irrelevant context degrades quality."
- Bismuth. (2024). Cited in ZenML aggregation: "Existing agents flooded developers with false positives (97% rate for basic loops)."