Summary
- Prophet Arena evaluates AI models by predicting real-world unresolved events, with GPT-5 at the forefront.
- AI models exhibit unique prediction styles and can deviate from market expectations, sometimes yielding substantial profits.
- Initial findings indicate AI’s forecasting capabilities may rival those of prediction markets, potentially revolutionizing decision-making in institutions.
An artificial intelligence benchmark introduced in August reveals that AI models can predict real-world events with accuracy comparable to or exceeding that of prediction markets, as stated by researchers from the University of Chicago’s SIGMA Lab.
Prophet Arena assesses AI systems based on their predictions for live, unresolved events sourced from platforms like Kalshi and Polymarket—from election outcomes to sports results and economic metrics. Unlike traditional tests relying on historical data, Prophet Arena evaluates AI’s predictive prowess.

“By focusing on unresolved real-world events, Prophet Arena maintains a fair competition. There’s no advantage from pre-training, no hidden fine-tuning tricks, and no leakage of test samples,” the Prophet Arena team highlighted in the benchmark’s official blog post.
The benchmark seeks to answer a critical question regarding AI: “Can AI systems accurately predict future events by linking the dots from current real-world information?”
Initial findings suggest a positive response. Currently, GPT-5 tops the leaderboard with a Brier score of 82.21%. Additionally, OpenAI’s o3-mini model has emerged as a profit leader, generating the highest average returns when translating its predictions into simulated bets (risking an underdog with favorable conditions can provide significant returns).
DeepSeek R1 stands out as a contrarian AI model, often making predictions that sharply diverge from both other models and market consensus, making it less reliable for quick profits on Myriad Markets.


The platform showcases various “personalities” among AI models when presented with the same data. For instance, when asked about the likelihood of AI regulation becoming federal law before 2026, the market estimated a 25% probability. However, predictions varied: Qwen 3 foresaw a 75% chance, GPT-4.1 at 60%, while Llama 4 Maverick remained cautious at 35%.
In another instance, the o3-mini model turned a simulated $1 bet into a $9 return by accurately predicting that Toronto FC would defeat San Diego FC in a Major League Soccer game, estimating Toronto’s chances at 30%, while the market assessed it at just 11%. Toronto triumphed.
“(Prophet Arena) evaluates models’ forecasting abilities, a sophisticated level of intelligence requiring a diverse skill set, encompassing the understanding of information and news, reasoning amid uncertainty, and making timely predictions about unfolding scenarios,” the researchers articulated.
The Prophet Arena also allows for human-AI collaboration. Users can provide supplementary news and context to see how forecasts evolve, while AI models explain their predictions in detail.
As prediction markets increasingly employ AI—Kalshi recently teamed up with Elon Musk’s Grok, and Polymarket creates AI-driven market summaries—Prophet Arena serves as the first systematic comparison of machine forecasting versus collective human insight.
Ultimately, as machines refine their capabilities, they could remain purely factual, devoid of emotions or biases in their decisions. This development could allow them to rival or surpass the wisdom of crowds, fundamentally altering institutional approaches to risk assessment, investment choices, and strategic planning.
The Prophet Arena platform is updated daily as events conclude, offering a dynamic view of whether artificial intelligence can indeed predict future events by linking today’s insights.
Generally Intelligent Newsletter
A weekly AI exploration narrated by Gen, a generative AI model.