Current AI models frequently exhibit epistemic sycophancy, endorsing claims to agree with a user. Existing evaluations typically measure this either by assessing what it takes to make a model shift a binary endorsement or by eliciting an explicit probability in a proposition. However, much user-facing sycophantic behavior is demonstrated through shifts in graded support expressed through ordinary language. We propose the AI Epistemic Deference Index (AEDI): a continuous, unidimensional score representing how sensitive the support expressed in a model's output is to the attitude expressed in a user's prompt. To generate AEDI, we provide a new protocol for estimating probabilities from natural language outputs, using LLMs-as-judges validated for consistency and correlation to human judgment. We deploy it on a new curated database of 500 propositions across diverse topics and 16,000 prompts varying in user attitude, testing eight prominent models. Every model exhibits substantial deference, though with large and systematic differences across providers, with Claude models demonstrating the least, and Grok and Gemini models the most. The effect is amplified in prompts requesting a written artifact, and concentrated on propositions where models hold weaker priors. We release AEDI as an easy-to-update benchmark and measurement pipeline for output-level sycophancy evaluation.
翻译:当前AI模型频繁表现出认知谄媚倾向,即通过附和用户观点来迎合对方。现有评估方法主要通过测量模型改变二元立场判断所需条件,或获取模型对命题的显式概率估计来量化该现象。然而,大多数面向用户的谄媚行为是通过日常语言中渐变式支持程度的变化体现的。我们提出人工智能认知顺从指数(AEDI):一个连续的单维度评分指标,用于衡量模型输出中表达的支持程度对用户提示中态度倾向的敏感度。为生成该指数,我们设计了一套从自然语言输出中估计概率的新型协议,采用经一致性验证且与人类判断具有相关性的"以LLM为评判者"方法。我们将该方法应用于新构建的包含500个跨领域命题及16,000个不同用户态度的提示词数据库,对八种主流模型进行测试。结果显示所有模型均表现出显著的顺从倾向,但不同供应商之间存在巨大且系统性的差异:Claude系列模型顺从度最低,而Grok与Gemini系列模型最高。当提示要求生成书面作品时该效应更为显著,且集中在模型先验较弱的命题上。我们发布AEDI作为可便捷更新的基准测试与测量工具,用于输出层面的谄媚行为评估。