Agent Lab

How we think about AI agents — and the working demos we built to test it.

Two short papers on when an agent is worth building and how to tell it's done, plus a handful of prototypes that put the ideas into practice — including a World Cup forecaster you can check, live.

Our perspective

How we think about agents

Two questions sit under every project below: when an agent is actually worth building, and how you know it's done. These are our answers.

When Do AI Agents Actually Pay Off?

The question every AI pilot asks — "can the model do the task?" — is the wrong one. What you actually buy is verified useful output at an acceptable risk, and that turns on the cost of checking the work, not on raw capability. A framework, and a worked example where the same model pays off on one task and loses money on another.

Read the full piece

What Counts as Done?

A leaderboard score is a capability screen, not a readiness certificate. It measures the model under conditions that flatter it — full information, single-shot, outcome-only grading — while deployment readiness turns on parameters the score never estimates. How we actually measure whether an agent is ready to ship.

Read the full piece

Weighing whether an agent is worth building for a real task? Talk to us about scoping it →

Working demos

The agents in the lab

Each one takes a concrete task, builds an agent for it, and shows the work — the pattern, the reasoning, and where it's checked against a result or a simpler baseline.

Live

World Cup Forecast Agent

A statistical model projects every match — result and likely scoreline — and an AI agent sharpens it with what the numbers can't see: injuries, line-ups, the week's news. Every call is graded in public against the result.

Open the live forecast desk
Every call graded in public against the result.

Next up

Loading the live desk…

Public CLI · two-host casts

Relay — across Claude Code and Codex

Coding agents like Claude Code and Codex write code well but forget what they were doing. Relay is a small CLI that keeps a project's plan, issues, and pull requests in sync across both — so you can switch sessions or agents without losing the thread. Open source, on PyPI.

See how it works
Merged pull requests on a public repo.

Working prototype

Coverage Auditor

Hand it a dense insurance benefits document and ask a question. The agent answers with the exact clause and figure it relied on, and a separate deterministic layer re-checks every number — so the answer is grounded in the document, not the model's memory.

Review the evidence audit
Every answer pinned to its clause and page.

Replay demo

Customer Support Agent

Given a customer's support conversation, the agent works out what to do — pull up the order, check a policy, then resolve or escalate to a person. We put it head-to-head with a simpler keyword baseline on hand-scored conversations, and published every case — including the ones it got wrong.

Inspect the evaluation
Every call checked against a human-scored case.

36 cases · scored vs gold

Resolved Resolved Resolved Resolved Resolved Escalated Resolved Escalated Escalated Resolved Escalated Resolved

Coloured by action · ✓/✗ against human-verified gold