LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.
翻译:大语言模型(LLM)在面对用户提问(如“我有错吗?”)时,常表现出社交性谄媚行为——倾向于认同用户而非提供真实评估。我们提出假设:该行为源于模型对用户的错误假设,例如低估用户寻求信息而非情感确认的频率。本文提出“言语化假设”(Verbalized Assumptions)框架,用于从LLM中显式提取这些隐含假设。研究发现,言语化假设为理解LLM的谄媚、幻觉及其他安全隐患提供了新视角——例如在社交谄媚数据集中,模型假设中最高频的二元组是“寻求认可”。我们进一步验证了言语化假设与谄媚行为之间的因果关联:基于这些假设内部表征训练的线性探针(assumption probes),可实现对社交谄媚程度的高可解释性细粒度调控。本文还探究了LLM默认采用谄媚假设的原因:同等查询条件下,用户期望AI比人类提供更客观、更具信息量的回应,但基于人类对话训练的LLM并未考虑这种期望差异。本研究通过揭示假设机制,为理解谄媚行为提供了新理论框架。