How We Think

Research-led perspectives on building AI agents you can trust — the point of view behind the Agent Lab exhibits.

Perspective

What Counts as Done?

A leaderboard score is a capability screen, not a readiness certificate. It measures the model under conditions that flatter it — full information, single-shot, outcome-only grading — while deployment readiness turns on parameters the score never estimates. How we actually measure whether an agent is ready to ship.

Jun 11, 2026 9 min

Perspective

When Do AI Agents Actually Pay Off?

The question every AI pilot asks — "can the model do the task?" — is the wrong one. What you actually buy is verified useful output at an acceptable risk, and that turns on the cost of checking the work, not on raw capability. A framework, and a worked example where the same model pays off on one task and loses money on another.

Jun 11, 2026 9 min