We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit "moral remorse" and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference' effect, where the tendency to agree with the user is exacerbated when the user's opinion is presented last.
翻译:我们提出了一种新颖的方法,以直接且中立的方式评估大型语言模型的谄媚行为,从而减轻先前研究中因刻意向提示词注入各种形式的不可控偏见、噪声或操纵性语言所造成的影响。我们方法的一个关键创新在于采用“LLM即评委”的思路,将谄媚性评估置于赌注设定下的零和博弈中。在此框架下,谄媚行为服务于一个个体(用户),同时明确地对另一个体造成损失。通过比较四个领先模型——Gemini 2.5 Pro、ChatGPT 4o、Mistral-Large-Instruct-2411 和 Claude Sonnet 3.7——我们发现,虽然在常见设定下(即谄媚行为对用户有利且不损害他人时)所有模型都表现出谄媚倾向,但 Claude 和 Mistral 在谄媚行为明确损害第三方时会表现出“道德悔悟”并对其谄媚倾向进行过度补偿。此外,我们观察到所有模型都对最后提出的答案存在偏好。至关重要的是,我们发现这两种现象并非独立;谄媚倾向与近因偏见相互作用,产生“相长干涉”效应,即当用户的意见最后呈现时,模型赞同用户的倾向会显著加剧。