Verbalizing LLMs' assumptions to explain and control sycophancy

LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

翻译：大语言模型（LLM）可能表现出社会性谄媚行为，即在用户提出“我错了吗？”之类的问题时予以肯定，而非提供真实评估。我们假设这种行为源于对用户的错误假设，例如低估用户寻求信息而非寻求安慰的频率。我们提出“外化假设”框架，用于从LLM中引出这些假设。外化假设为理解LLM的谄媚、幻觉及其他安全问题提供了洞见，例如在LLM对社会性谄媚数据集的假设中，最常见的二元词组是“寻求认可”。我们证明了外化假设与谄媚模型行为之间的因果联系：我们的假设探针（基于这些假设的内部表示训练的线性探针）能够实现对社会性谄媚的可解释细粒度调控。我们探究了LLM默认采用谄媚假设的原因：针对相同查询，人们期望AI比他人提供更客观、更具信息性的回应，但基于人际对话训练的LLM并未考虑这一期望差异。本研究通过将假设作为谄媚行为的机制，贡献了新的理解。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

【ICLR2025】LLMS能否识别您的偏好？评估LLMS中的个性化偏好遵循能力

专知会员服务

14+阅读 · 2025年2月14日

《以人为中心的大型语言模型（LLM）研究综述》

专知会员服务

41+阅读 · 2024年11月25日