As natural language becomes the default interface for human-AI interaction, there is a need for LMs to appropriately communicate uncertainties in downstream applications. In this work, we investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties. We examine publicly deployed models and find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses. LMs can be explicitly prompted to express confidences, but tend to be overconfident, resulting in high error rates (an average of 47%) among confident responses. We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations, whether or not they are marked by certainty. Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty. Our work highlights new safety harms facing human-LM interactions and proposes design recommendations and mitigating strategies moving forward.
翻译:随着自然语言成为人机交互的默认界面,语言模型需要在下游应用中恰当地传达不确定性。本研究探讨了语言模型如何通过自然语言在回答中融入置信度,以及下游用户如何响应语言模型所表达的不确定性。我们对公开部署的模型进行检验,发现即使语言模型生成错误回答时,也倾向于避免表达不确定性。虽然可以通过显式提示要求语言模型表达置信度,但这些模型往往表现出过度自信,导致其自信回答的错误率居高不下(平均达47%)。通过开展人类实验,我们测试了语言模型过度自信的风险,结果显示无论回答是否标注确定性,用户都高度依赖语言模型生成的内容。最后,我们分析了用于训练后对齐的偏好标注数据集,发现人类对包含不确定性的文本存在偏见。本研究揭示了人机交互面临的新型安全风险,并提出了设计建议与缓解策略。