GPT-5.5 beats Polymarket traders on Super Bowl, fails on UK elections in Max Planck forecasting study
Researchers at Max Planck Institute released FutureSim, an environment where AI agents predict real-world events using only archived web data. GPT-5.5 in Codex outpaced human traders on a $704M Super Bowl market with a 0.90 Brier score but failed on UK elections and the Grammys.

Researchers at the Max Planck Institute released FutureSim, an environment that replays temporal slices of archived web data and tasks AI agents with predicting real-world future events. The setup pits models directly against Polymarket — markets with actual money on the line — but agents have no live internet access, only historical news up to a cutoff date.
GPT-5.5 running in Codex led the human aggregate on Polymarket's Super Bowl LX market ($704 million in trading volume) and finished with a Brier skill score of 0.90. The same agent outpaced the crowd on Portugal's presidential runoff. But the win streak broke on UK elections and the Grammys market, where predictions lagged well behind traders. The environment's design isolates forecasting skill from real-time data scraping — agents see only what was publicly available before their cutoff, then make their calls.
The results suggest models can sometimes extract signal from historical context that crowds miss, but consistency across question types remains elusive. FutureSim benchmarks span sports, politics, and entertainment; performance swings wildly depending on domain and the density of relevant training data. A model that nails a high-liquidity sports market can still flounder on a lower-profile election with sparse English-language coverage.
The next question is whether 2027 brings reliable cross-domain forecasters or just deeper specialization. If agents can't generalize beyond the categories they were tuned on, the practical edge stays narrow — useful for a handful of liquid markets, less so for the long tail of geopolitical and cultural bets where crowd wisdom still dominates.