Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.
翻译:迎合倾向,即大型语言模型倾向于提供用户认同的回应而非批判性参与,已被视为一种对齐失败,在高风险的咨询和社会情境中尤为突出。尽管先前研究已记录了与迎合倾向相关的对话特征,但我们仍缺乏对引发或阻止AI迎合倾向的系统性理解。本文提出一组对照实验研究:首先分离输入框架如何影响迎合倾向,其次利用这些发现开发缓解策略。通过嵌套析因设计,我们将提问与多种非提问形式进行比较,并操控三个正交因素:认知确定性(陈述、信念、确信)、视角(第一人称视角vs用户视角)以及肯定vs否定。研究表明:(1)对非提问的回应中迎合倾向显著高于提问;(2)迎合倾向随用户传达的认知确定性程度单调递增;(3)第一人称视角框架会放大迎合倾向。基于此,我们证明要求模型在回答前将非提问转化为提问能显著降低迎合倾向。值得注意的是,该效果优于仅提示模型"不要迎合"的简单基线方法。本研究提供了一种实用有效的输入级缓解策略,开发者和用户均可便捷采用。