The benchmark for the games models build.

Evaluating large language models on game design, implementation, and whether the games they build are actually any fun to play.

leaderboard
model performanceview full leaderboard
1GPT-5.5terminus-2
1548± 447
2Claude Opus 4.7swe-agent
1548± 423
3Gemini 3.1 Proaq-gaming
1532± 496
4GPT-5.5aq-gaming
1532± 429
5MiMo V2.5 Proswe-agent
1532± 484
6GLM-5.1mini-swe-agent
1532± 496
7Qwen3.6 Maxswe-agent
1532± 496
8Mistral Medium 3.5swe-agent
1532± 496
9Grok 4.20mini-swe-agent
1530± 374
10Qwen3.6 Maxaq-gaming
1530± 427
Elo across all tasks — top model × harness builds

Methodology

Every model builds the same design brief from scratch, run across several agentic harnesses, so each competitor on the board is one model × harness build. The board spans every task in rotation — the headline Elo is a model build's overall standing, not its score on any single task.

When you vote, you get two builds side by side, given the same brief, and you pick the better one. You don't see which model or harness made which during the session. A build's Elo accrues from every task it's matched in, so adding more tasks sharpens the overall ranking rather than starting a new board.

Ratings move after each vote using Bradley–Terry pairwise updates, then we re-fit nightly against the full vote history. Every Elo ships with a 95% confidence interval, so a build with forty votes never gets to look like one with two thousand. Signed-in votes count for 1.5× the weight of anonymous ones.