Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human-crowd forecasting-tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of 12 LLMs. We compare the aggregated LLM predictions on 31 binary questions to those of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark, and is not statistically different from the human crowd. We also observe a set of human-like biases in machine responses, such as an acquiescence effect and a tendency to favour round numbers. In Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%, though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of the human crowd: via the simple, practically applicable method of forecast aggregation.
翻译:实践中的预测准确性依赖于“群体智慧”效应,即通过对个体预测者群体的预测结果进行聚合,能够显著提升对未来事件的预测能力。过去关于大语言模型预测能力的研究表明,前沿LLM作为个体预测者,其表现逊于人类群体预测竞赛聚合结果这一黄金标准。在研究1中,我们通过采用由12个LLM组成的群体集成方法拓展了该研究。我们将LLM群体对31个二元问题的聚合预测结果,与来自为期三个月预测竞赛的925名人类预测者群体进行了比较。我们预先注册的主要分析表明,LLM群体表现优于简单的无信息基准,且与人类群体无统计学差异。我们还观察到机器响应中存在一系列类人偏差,例如默许效应和偏爱整数的倾向。在研究2中,我们测试了LLM预测能否通过借鉴人类认知输出得到改进。我们发现GPT-4和Claude 2的预测准确性均能通过获取人类预测中位数作为信息而受益,准确率提升幅度在17%至28%之间,但这种方法产生的预测准确性低于直接对人与机器预测结果进行平均。我们的研究结果表明,LLM能够通过简单且实际可行的预测聚合方法,实现与人类群体相媲美的预测准确性。