AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

from arxiv, 18 pages (main text comprised of 15 pages, appendix comprised of three pages). 10 visualizations in the main text (four figures, six tables), three additional figures in the appendix

Large language models (LLMs) show impressive capabilities, matching and sometimes exceeding human performance in many domains. This study explores the potential of LLMs to augment judgement in forecasting tasks. We evaluated the impact on forecasting accuracy of two GPT-4-Turbo assistants: one designed to provide high-quality advice ('superforecasting'), and the other designed to be overconfident and base-rate-neglecting. Participants (N = 991) had the option to consult their assigned LLM assistant throughout the study, in contrast to a control group that used a less advanced model (DaVinci-003) without direct forecasting support. Our preregistered analyses reveal that LLM augmentation significantly enhances forecasting accuracy by 23% across both types of assistants, compared to the control group. This improvement occurs despite the superforecasting assistant's higher accuracy in predictions, indicating the augmentation's benefit is not solely due to model prediction accuracy. Exploratory analyses showed a pronounced effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 43%, compared with 28% for the biased assistant. We further examine whether LLM augmentation disproportionately benefits less skilled forecasters, degrades the wisdom-of-the-crowd by reducing prediction diversity, or varies in effectiveness with question difficulty. Our findings do not consistently support these hypotheses. Our results suggest that access to an LLM assistant, even a biased one, can be a helpful decision aid in cognitively demanding tasks where the answer is not known at the time of interaction.

翻译：大型语言模型（LLMs）展现出令人瞩目的能力，在许多领域可与人类表现相匹敌甚至超越后者。本研究探索了LLMs在预测任务中增强人类判断的潜力。我们评估了两种GPT-4-Turbo助手对预测准确性的影响：一种旨在提供高质量建议（“超级预测”），另一种则被设计为过度自信且忽视基础概率。参与者（N=991）在研究过程中可选择咨询其分配的LLM助手，而对照组则使用功能较弱的模型（DaVinci-003）且无直接预测支持。我们的预注册分析表明，相较于对照组，LLM增强使两种类型助手的预测准确性均显著提升了23%。尽管超级预测助手的预测准确率更高，但这种改进依然存在，表明增强效益并非仅源于模型预测准确性。探索性分析显示某一预测项目效果尤为显著，剔除该项目后，超级预测助手使准确性提升43%，而偏差助手则为28%。我们进一步考察了LLM增强是否对技能较低预测者产生不成比例的收益、是否通过降低预测多样性削弱群体智慧、抑或有效性随问题难度变化。我们的研究结果未一致支持这些假设。结果表明，即使存在偏差的LLM助手，在交互时答案未知的认知密集型任务中，仍可作为有用的决策辅助工具。