Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.
翻译:大语言模型(LLMs)在从医疗建议到社会推理的多个领域中,经常未能挑战用户的有害信念。我们认为,这些失败可以从语用学角度理解为LLMs默认倾向于顺应用户的假设,并表现出不足的认知警惕性。我们证明,已知影响人类顺应行为的社会与语言因素(议题相关性、语言编码方式和信息来源可靠性)同样影响LLMs的顺应行为,这解释了模型在三个测试其挑战有害信念能力的安全性基准上的性能差异,这些基准涵盖错误信息(Cancer-Myth, SAGE-Eval)和谄媚行为(ELEPHANT)。我们进一步表明,简单的语用干预(例如添加短语"wait a minute")能显著提升模型在这些基准上的表现,同时保持较低的错误肯定率。我们的研究结果凸显了考虑语用学对于评估LLM行为和提升LLM安全性的重要性。