Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.

翻译：人类预测的准确性在实践中依赖于"群体智慧"效应，即通过对个体预测者群体的预测进行聚合，能够显著提升对未来事件的预测效果。以往关于大语言模型（LLMs）预测能力的研究表明，作为独立预测者的前沿LLMs，其表现逊于人类群体预测锦标赛的黄金标准聚合结果。在研究1中，我们通过采用由12个LLMs构成的群体集成方法扩展了这项研究。我们将31个二分类问题上的LLM聚合预测结果与一项为期三个月的人类预测锦标赛中925名预测者的群体结果进行比较。我们预先注册的主分析表明，LLM群体的表现优于简单的无信息基准，且与人类群体相比无统计学显著差异。在探索性分析中，我们发现这两种方法在中等效应量等效界值范围内具有等效性。同时观察到一种默许效应，即尽管正面与负面结果几乎各占一半，但模型平均预测结果显著高于50%。此外，在研究2中，我们测试了LLM（GPT-4与Claude 2）的预测能否通过借鉴人类认知输出得到改进。我们发现，将人类预测中位数信息作为输入，可使两个模型的预测准确性提升17%至28%：尽管这导致预测精度低于简单平均人类与机器预测的结果。我们的研究表明，通过简单且具实际应用价值的预测聚合方法，LLMs能够达到媲美人类群体预测锦标赛的预测精度。这复现了LLMs的"群体智慧"效应，并为其在社会各领域的广泛应用开辟了可能性。