Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of large language models to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.

翻译：准确预测未来将是人工智能能力的一个重要里程碑。然而，关于大型语言模型对未来事件提供概率预测能力的研究仍处于起步阶段。为实证检验这一能力，我们让OpenAI最先进的大型语言模型GPT-4参与了Metaculus平台上为期三个月的预测竞赛。该竞赛于2023年7月至10月举行，吸引了843名参与者，涵盖了大科技、美国政治、病毒疫情和乌克兰冲突等多个主题。聚焦于二元预测问题，我们发现GPT-4的概率预测准确性显著低于人类群体预测的中位数。我们还发现GPT-4的预测与将每个问题赋值为50%概率的无信息预测策略没有显著差异。我们探讨了一种可能的解释——GPT-4可能倾向于预测接近标度中点的概率，但数据不支持这一假设。总体而言，我们发现GPT-4在现实世界预测任务中的表现显著低于人类群体预测的中位数。这一表现不佳的一个潜在解释是，在现实世界预测竞赛中，真实答案在预测时是未知的；这不同于专业考试或时间序列预测等其他基准任务，在这些任务中，强表现至少部分可能源于从训练数据中记忆答案。这使得现实世界预测竞赛成为未来测试人工智能泛化推理和预测能力的理想环境。