Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark ($N=200$). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (p-value $=0.01$). We display system and human scores in a public leaderboard at www.forecastbench.org.
翻译:对未来事件的预测是做出明智决策的关键输入。机器学习系统具备大规模提供预测的潜力,但目前缺乏一个标准化的预测问题集来评估机器学习系统的预测准确性。为弥补这一空白,我们提出了ForecastBench:一个动态基准,用于评估机器学习系统在自动生成且定期更新的1000个预测问题集上的准确性。为避免任何数据泄露的可能性,ForecastBench仅包含提交时尚未有已知答案的未来事件问题。我们通过收集专家(人类)预测者、普通公众以及大型语言模型在基准中随机选取的问题子集($N=200$)上的预测,量化了当前机器学习系统的能力。尽管大型语言模型在许多基准测试中已实现超人类表现,但在此项评估中表现欠佳:专家预测者的表现优于表现最佳的大型语言模型(p值 $=0.01$)。我们在www.forecastbench.org的公开排行榜上展示了系统与人类的得分。