Uncertainty expressions such as ``probably'' or ``highly unlikely'' are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans interpret these expressions, there has been little inquiry into the abilities of language models to interpret such expressions. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model's own certainty about that statement. We evaluate both humans and 10 popular language models on a task created to assess these abilities. Unexpectedly, we find that 8 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI alignment and AI-AI communication.
翻译:诸如“可能”或“极不可能”等不确定性表达在人类语言中无处不在。尽管先前的研究已证实,人类对这些表达方式的解释存在群体层面的一致性,但对于语言模型解释此类表达的能力却鲜有探究。本文研究了语言模型如何将语言中的不确定性表达映射为数值响应。我们的方法评估了语言模型在此情境下是否能够运用心理理论:理解另一个智能体对特定陈述的不确定性,而独立于模型自身对该陈述的确定性。我们在一个为评估这些能力而设计的任务上,同时评估了人类和10个流行的语言模型。出乎意料的是,我们发现10个模型中有8个能够以类人的方式将不确定性表达映射为概率响应。然而,我们观察到模型的行为会根据陈述实际上的真或假而呈现系统性差异。这种敏感性表明,与人类相比,语言模型明显更容易受到其先验知识所导致的偏见影响。这些发现提出了重要问题,并对人机对齐以及人工智能间的通信具有广泛的意义。