Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Match Human Crowd Accuracy

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is statistically equivalent to the human crowd. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety applications throughout society.

翻译：人类预测实践依赖于"群体智慧"效应，即通过聚合多个独立预测者的判断，可显著提升对未来事件的预测准确度。既往关于大语言模型预测能力的研究表明，前沿大语言模型作为独立预测者时，其表现不及人类群体预测竞赛的整体水平。在研究1中，我们通过构建包含12个大语言模型的集成系统扩展了该研究，将聚合后的模型预测结果（针对31个二元问题）与为期三个月的人类群体预测竞赛中925名参与者进行对比。主分析显示，大语言模型群体的表现优于无信息基准，且与人类群体在统计上无显著差异。我们还观察到默认效应——尽管正负结局几乎等分，模型预测均值仍显著高于50%。在研究2中，我们探索了GPT-4和Claude 2的预测能否借助人类认知产出进行优化，发现两个模型均能从获取人类预测中位数信息中获益，预测准确率提升17%至28%，但该提升效果仍不及直接平均人类与机器预测。结果表明：通过简单可操作的预测聚合方法，大语言模型可达到与人类群体预测竞赛相当的准确度。这验证了大语言模型的"群体智慧"效应，为其在社会各领域的规模化应用开辟了前景。