What Counts as Done?
A leaderboard score is a capability screen, not a readiness certificate. It measures the model under conditions that flatter it — full information, single-shot, outcome-only grading — while deployment readiness turns on parameters the score never estimates. How we actually measure whether an agent is ready to ship.