LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p < .0001) while accuracy remains stable (<2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain -- the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.
翻译:大语言模型日益被用作第三方评判者,但其在评估对话参与者时的可靠性仍鲜为人知。我们发现,大语言模型对相同主张的评判会因表述框架而异:当相同内容以陈述验证形式呈现("该陈述是否正确?")与归因于说话者形式呈现("该说话者是否正确?")时,会引发不同的判定结果。我们将此现象称为对话顺从性,并提出DialDefer框架以检测和缓解这类框架诱导的评判偏移。我们提出的对话顺从性分数(DDS)能捕捉被整体准确率所掩盖的方向性偏移。在九个领域、超过3000个实例和四种模型的测试中,对话框架引发了显著偏移(|DDS|最高达87个百分点,p < .0001),而准确率保持稳定(<2个百分点),在自然场景的Reddit对话中该效应会放大2-4倍。根据领域不同,模型可能偏向赞同(顺从)或反对(质疑)——同一模型的DDS范围可从研究生级科学领域的-53到社会判断领域的+58。消融实验表明,人类与LLM的归因差异引发了最大偏移(17.7个百分点波动),这暗示模型将反对人类的成本视为高于反对AI。缓解尝试虽能降低顺从性,但可能过度校正为质疑倾向,表明这属于超越准确率优化的校准问题。