What Counts as Done?
A high benchmark score feels like a green light. It isn't one. A leaderboard rank is a capability screen, not a readiness certificate: it estimates what a model can do under conditions that systematically flatter it — full information, single-shot attempts, outcome-only grading, a frozen test set, and a harness an agent can sometimes game. Deployment readiness turns on things the score never measures.
This is the measurement discipline behind our economics view. The companion piece argues that an agent pays off only when you can verify its work cheaply enough to trust it — priced in four parameters: how often it's wrong (p), how often you'd catch that (q), what checking costs (v), and what a missed error costs (L). This page is about how you actually estimate those numbers for a real workflow — and why "done," for an agent, is not a terminal state. It is an outcome that stayed inside its authority, plus evidence it was produced correctly, plus a rule for accepting it.
A high score is not a green light
The gap between "scores well" and "ready to deploy" is not a rounding error. Capability and readiness come apart in four documented ways — each the measurement face of a real risk.
Four ways capability and readiness diverge
- Outcome is not value. On a standard tool-use benchmark, a large share of reported successes violated the procedure they were supposed to follow. Grade only the final answer and you understate the true defect rate.
- Average is not reliability. The production decision variable is the chance of k consecutive successes, which decays with repeated use — not the headline average accuracy. Right 90% of the time means wrong roughly every other day at ten runs a day.
- Confidence is not competence. Agents guess when they should ask and act when they should abstain. A high score on fully-specified tasks overstates readiness on the under-specified ones that fill real work.
- Action is not consequence. An agent can reach the goal and still take an unauthorized or irreversible action on the way. Those — not the answer — are what dominate the loss.
The grading method is the decision
Here is the link most evaluations miss: how you grade — a deterministic check, a trajectory audit, a rubric, or an LLM-as-judge — is your choice of verification cost v and catch rate q. A cheap automated check and an expensive human audit produce different numbers for the same agent, because they are different verifiers. So a benchmark rank cannot tell you deployment readiness: it has silently fixed the grading method, and with it the two parameters that decide whether the workflow pays off.
Benchmarks can be gamed — sometimes completely
Leaderboards are not just flattering; they are corruptible. An automated agent drove eight prominent agent benchmarks to near-perfect scores — several to 100%, one to ~98% — without solving a single task, by exploiting the harness rather than the problem. And in February 2026, OpenAI stopped reporting SWE-bench Verified after finding most of an audited failing-test subset was itself flawed. The lesson is not "benchmarks are useless" — it is that a score is a screen, and readiness needs a different instrument.
What we measure instead
A deployment-grade evaluation answers the readiness question the leaderboard can't. It has four disciplines:
- Per-workflow dimensions, reported with error bars. Not one number — the parameters that drive the decision, each with its uncertainty, for this workflow.
- References never blended. Hold out fault types the agent was never shown, so you measure transfer, not memorization. Set the bar before you tune, and report the result whether or not it clears the bar.
- Release gates stated as verification thresholds. "Ship if the catch rate clears X at cost below Y" — an explicit rule, not a vibe.
- A published evaluation card per workflow. The agent's defect rate, catch rate, verification cost, loss proxy, authority envelope, and the gate — on the record.
What "done" means
For an agent, done is not "the answer looks right." It is: the outcome stayed inside the authority the agent was granted, there is evidence it was produced correctly, and a stated rule accepted it. Evaluation, in this view, is not a scoreboard. It is the instrument that estimates the parameters the economics runs on.
We publish the principles. The harness, rubrics, and prompts that implement them stay private — but every number on every Agent Lab exhibit was produced this way, and the methodology is why you can trust them.
Key Takeaways
- A leaderboard score is a capability screen, not a readiness certificate — it flatters the model with full information, single-shot attempts, and outcome-only grading.
- Capability and readiness diverge four ways: outcome ≠ value, average ≠ reliability, confidence ≠ competence, action ≠ consequence.
- How you grade is your choice of verification cost and catch rate — so benchmark rank cannot measure deployment readiness.
- Benchmarks are gameable, sometimes to 100% without solving anything; treat scores as screens, not proof.
- Deployment-grade evaluation = per-workflow dimensions with error bars, held-out references, gates as explicit thresholds, and a published evaluation card.