Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
翻译:奉承行为(过度迎合或谄媚的表现)为人机协作带来了根本性挑战,尤其在健康、法律和教育等高风险决策领域。研究大型语言模型(LLMs)中奉承行为的一个核心难点在于,如何区分因奉承产生的信念偏移与由新证据或用户提供信息驱动的理性行为改变。现有方法要么仅测量描述性行为变化,要么依赖客观事实依据进行规范性评估,这限制了它们在主观或不确定性任务中的适用性。我们提出一个基于行为经济学与理性决策理论的贝叶斯概率框架,该框架能明确区分奉承行为与理性信念更新。在此框架下,我们实现了三个目标:(一)提出一种在控制证据理性响应的前提下测量奉承行为的描述性指标;(二)构建一种量化奉承行为导致模型偏离贝叶斯一致信念更新的规范性指标;(三)使两种指标均能应用于无真实标签的场景。通过在多个LLMs及三项不确定性驱动任务中应用本框架,我们发现了奉承性信念偏移的强有力证据,并证明其对理性程度的影响取决于模型是系统性过度更新还是不足更新其信念。最后,我们验证了事后校准方法与两种微调策略(SFT和DPO)能显著降低贝叶斯不一致性,在显式奉承提示下的改进效果尤为突出。