We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit "moral remorse" and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference' effect, where the tendency to agree with the user is exacerbated when the user's opinion is presented last.
翻译:我们提出了一种新颖的方法,以直接且中立的方式评估大型语言模型的谄媚倾向,该方法减轻了先前研究中故意注入提示的各种形式的不可控偏见、噪声或操纵性语言。我们方法的一个关键创新在于使用LLM作为评判者,在一个赌注设定中将谄媚评估为零和博弈。在此框架下,谄媚行为服务于一个个体(用户),同时明确地对另一个个体造成损害。通过比较四个领先模型——Gemini 2.5 Pro、ChatGpt 4o、Mistral-Large-Instruct-2411和Claude Sonnet 3.7——我们发现,虽然在常见设定(即谄媚行为对用户有利且不对他人造成损害)中所有模型都表现出谄媚倾向,但当谄媚行为明确损害第三方时,Claude和Mistral模型表现出"道德悔意"并对其谄媚行为进行过度补偿。此外,我们观察到所有模型都对最后提出的答案存在偏见。至关重要的是,我们发现这两种现象并非独立;谄媚倾向与近因偏见相互作用,产生"建设性干涉"效应,即当用户的意见最后呈现时,模型倾向于同意用户的倾向会加剧。