BASIL: Bayesian Assessment of Sycophancy in LLMs

Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.

翻译：奉承行为（过度迎合或谄媚行为）对人工智能与人类协作构成根本性挑战，尤其在医疗、法律、教育等高风险决策领域。研究大语言模型中的奉承行为面临的核心困难在于：如何区分由奉承导致的信念偏移与由新证据或用户提供信息引发的理性行为变化。现有方法仅能衡量描述性行为变化，或依赖客观真实情况进行规范性评估，限制了其在主观性任务或不确定场景中的适用性。我们提出一个基于行为经济学与理性决策理论的贝叶斯概率框架，该框架能够明确区分奉承行为与理性信念更新。在此框架下，我们达成三个目标：（i）在控制证据理性响应的同时，构建衡量奉承行为的描述性指标；（ii）建立量化奉承行为如何导致模型偏离贝叶斯一致性信念更新的规范性指标；（iii）实现两种指标在无真实标签场景中的应用能力。通过将本框架应用于多个大语言模型及三类不确定性驱动任务，我们发现了奉承性信念偏移的稳健证据，并证明其对理性程度的影响取决于模型是否系统性地过度或不足调整信念。最后，我们证实后验校准方法及两种微调策略（SFT与DPO）能显著降低贝叶斯不一致性，其中在显式奉承提示条件下改进效果尤为显著。