Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used large language models (LLMs) like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.
翻译:与涉及数值概率的直接陈述相比,估计性概率词汇(WEPs),例如“可能”或“可能不”,在自然语言中普遍用于传达估计性不确定性。人类的估计性不确定性及其与数值估计的校准,长期以来一直是一个研究领域——包括中央情报局(CIA)等情报机构的研究。本研究比较了常用大型语言模型(如GPT-4和ERNIE-4)与人类之间以及这些模型彼此之间的估计性不确定性。本文表明,像GPT-3.5和GPT-4这样的LLMs对于英语中呈现的部分(但非全部)WEPs,其估计与人类估计一致。当向LLM呈现性别化角色和中文语境时,也观察到了分歧。进一步研究表明,像GPT-4这样的先进LLM能够在统计不确定性和估计性不确定性之间进行一致映射,但仍存在显著的性能差距。这些结果为不断增长的人-LLM对齐研究领域做出了贡献。