Scorecard

How good is good?

A forecast is only honest if it's scored. We publish every number ahead of kickoff and grade it once the match is played — model and agent both. The hard part is knowing what a good score even is, so we anchor it explicitly.

We score with the Brier score — the squared distance between the forecast and what happened, summed over win/draw/loss. Lower is better: 0 is a perfect call, ~0.67 is a coin-flip that ignores the teams. The floor is exactly that coin-flip — always predict the long-run base rate — and beating it is the minimum bar. The sharp market is the practical ceiling: the consensus of ~12 bookmakers' per-match prices, with the margin stripped out. The model is judged by where it lands between the two; the agent only by whether its calls improve the model's score. Why soccer is hard →

The model, against its benchmarks

78 matches scored

Floor · naive prior

0.654 Brier

Top pick right 44.9% of the time

Always predict a home win — the most common result (44/22/33 W/D/L across historical internationals). The minimum bar.

Our model

0.509 Brier · 0.145 under floor

Top pick right 61.5% of the time (+16.6 pp)

0.041 Brier behind the market on the 43 matches with odds.

Sharp market · benchmark

0.421 Brier

Top pick right 72.1% of the time

De-vigged bookmaker consensus over 43 matches with published odds. The practical ceiling.

The full record

Calibration

When the model says 60%, does it happen 60% of the time? Reliability bins from the walk-forward backtest.

Track record

Champion odds as they moved across runs, and the live Brier (0.509 vs floor 0.654) over scored matches.

Model vs market

The model beside Polymarket and Kalshi, each de-vigged separately — never pooled into a consensus.

Agent scorecard

Does the agent's reading of live context improve on the model? Every move, scored once the match is played.

The agent, scored

All calls →

72/72

Matches analysed

Picks overturned

Calls graded

62.5%

Agent top-pick

Over 72 graded calls, the agent's top pick was right 62.5% of the time. Each call also carries a signed score delta versus the model — whether that specific move helped or hurt — on its match page. Its biggest nudge so far: Portugal v DR Congo, +13.7 pp.