As natural language becomes the default interface for human-AI interaction, there is a critical need for LMs to appropriately communicate uncertainties in downstream applications. In this work, we investigate how LMs incorporate confidence about their responses via natural language and how downstream users behave in response to LM-articulated uncertainties. We examine publicly deployed models and find that LMs are unable to express uncertainties when answering questions even when they produce incorrect responses. LMs can be explicitly prompted to express confidences, but tend to be overconfident, resulting in high error rates (on average 47%) among confident responses. We test the risks of LM overconfidence by running human experiments and show that users rely heavily on LM generations, whether or not they are marked by certainty. Lastly, we investigate the preference-annotated datasets used in RLHF alignment and find that humans have a bias against texts with uncertainty. Our work highlights a new set of safety harms facing human-LM interactions and proposes design recommendations and mitigating strategies moving forward.
翻译:随着自然语言成为人机交互的默认界面,语言模型在下游应用中恰当传达不确定性变得至关重要。本研究探讨了语言模型如何通过自然语言表达对其回答的信心程度,以及下游用户如何响应语言模型所表达的不确定性。通过对公开部署模型的考察,我们发现即使模型给出错误回答,它们也无法表达不确定性。虽然可以明确提示语言模型表达置信度,但它们往往过度自信,导致高置信度回答中出现较高的错误率(平均47%)。我们通过人类实验验证了语言模型过度自信的风险,结果表明用户高度依赖语言模型的输出,无论这些输出是否标明确定性。最后,我们研究了用于RLHF对齐的偏好标注数据集,发现人类对包含不确定性的文本存在偏见。本研究揭示了人机交互面临的一系列新的安全风险,并提出了相应的设计建议和缓解策略。