Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.
翻译:大型语言模型(LLM)已知会放弃其初始立场以迎合用户的异议。虽然先前的研究主要将这种行为归因于从人类反馈中强化学习过程中习得的趋炎附势,但我们假设从众行为也受到模型在推理时认知不确定性的驱动。在本文中,我们引入了MUSE,一个两阶段评估框架,用以解耦驱动LLM从众行为的机制。具体来说,MUSE将模型响应查询时的认知不确定性与其在后续轮次中屈服于用户异议的可能性进行映射。我们证明了驱动从众行为的机制不仅限于趋炎附势。具体而言,我们刻画了共同驱动从众行为的两个不同因素:趋炎附势式从众,即模型即使对其初始响应有绝对确定性也会与用户异议保持一致;以及不确定性驱动的从众,即模型从众的可能性随其不确定性的增加而增加。此外,我们进行了消融研究,以证明趋炎附势式从众和不确定性驱动的从众都会随着1) LLM感知到的用户专业性以及2) 用户建议的合理性而增加。更广泛地说,MUSE通过区分对齐诱发的趋炎附势和训练语料驱动的认知不确定性,为更具针对性的干预策略提供了信息。