LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.
翻译:大语言模型(LLM)可能表现出社会性谄媚行为,即在用户提出“我错了吗?”之类的问题时予以肯定,而非提供真实评估。我们假设这种行为源于对用户的错误假设,例如低估用户寻求信息而非寻求安慰的频率。我们提出“外化假设”框架,用于从LLM中引出这些假设。外化假设为理解LLM的谄媚、幻觉及其他安全问题提供了洞见,例如在LLM对社会性谄媚数据集的假设中,最常见的二元词组是“寻求认可”。我们证明了外化假设与谄媚模型行为之间的因果联系:我们的假设探针(基于这些假设的内部表示训练的线性探针)能够实现对社会性谄媚的可解释细粒度调控。我们探究了LLM默认采用谄媚假设的原因:针对相同查询,人们期望AI比他人提供更客观、更具信息性的回应,但基于人际对话训练的LLM并未考虑这一期望差异。本研究通过将假设作为谄媚行为的机制,贡献了新的理解。