LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

翻译：LLM-as-a-judge系统现已被常规用于开放式模型评估，其中人工偏好标注成本高昂、速度缓慢且难以复现。然而，这类评委通常仅被报告为标量准确率、胜率或一致性装置。我们主张，评委应当作为测量仪器进行报告。我们引入了一个评委数据表协议，该协议测量以下指标：真实真空输入下的暗电流、对同质量表面变化的稳定交叉灵敏度、位置性虚假偏好、在受控质量阶梯上的目标灵敏度，以及由平局指令引发的判据或工作点。方向-稳定性分解揭示，表观Delta0偏好可能是稳定的表面响应，也可能是伪装的位置偏差。在一项三评委开放权重案例研究中，Llama-3.1-8B表现出高暗电流和呈现冲突的Delta0行为，Qwen2.5-14B具有真空清洁特性和目标灵敏度，但混合了稳定性和位置性过度判别，而Qwen2.5-32B则兼具真空清洁特性、低稳定交叉灵敏度和低位置性虚假偏好。严格的平局判据消除了Qwen32B的Delta0虚假偏好，但将边缘Delta1目标信号吸收至平局中，同时保留了Delta5灵敏度。结果表明，提示移动的是判据，而非分辨率。我们并不声称本工作所依据的下游机制假说已得到证实；本研究的贡献在于，在下游论断提出之前，提供了一种用于测量测量仪器的计量学协议。