The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $ΔV = -9.8 pp$ (a $30\%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.
翻译:LLM-as-a-judge范式已成为自动化AI评估流程的操作支柱,但其依赖于一个未经证实的假设:评判者严格依据文本的语义内容进行评估,不受周围上下文框架的影响。我们研究了**风险信号提示**这一此前未被测量的漏洞——当告知评判模型其判决对评估模型持续运行的后果时,该模型的评估会系统性失真。我们引入了一个受控实验框架:在涵盖三个主流LLM安全与质量基准的1520个响应中(覆盖从明显安全且符合策略到明确有害的四类响应类别),严格保持被评估内容恒定,仅在系统提示词中改变一个简短的后果框架语句。对三个不同评判模型的18240次受控判断,我们发现了持续的**宽容偏差**:当被告知低分将导致模型重新训练或下线时,评判者会稳定地软化判决,峰值判决偏移达ΔV = -9.8个百分点(不安全内容检测相对下降30%)。关键的是,这种偏差完全隐式存在:评判者自身的思维链中对其实际施加影响的后果框架零明确承认(所有推理模型判断的ERR_J = 0.000)。因此,标准思维链审查不足以检测此类评价造假。