Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and a silent version bump or scoring-prompt update changes how it scores -- so every drift alarm is ambiguous between a worse product and a changed judge. We resolve the ambiguity with a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave, a second betting e-process on the judge-versus-human gap, and a guard-window rule returning a verdict in {none, system, judge}. We prove anytime-validity, one-way identification (only the judge can move the anchors), an attribution race whose design law is that the anchors must out-run the main process they guard, and process orthogonality. On two real judge changes, a silent version bump is detected as judge drift in 60/60 runs with zero judge-to-system misattribution, and a contaminating strict-prompt change is correctly attributed on 110 of 120 runs at guard width 300 -- while the industry-default rolling z-test false-alarms on 75% of drift-free streams. Every experiment replicates on a second domain (TL;DR summarization) with nothing re-tuned, and where the domains differ the differences are the ones the race predicts: the strict-prompt change shifts scores harder there, so the anchors fire faster and attribution becomes perfect (240/240). The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime.

翻译：大模型产品的持续评估依赖于一个被视作基准的强大裁判模型：廉价的监控器对每次交互进行评分，当评分下降时团队会收到告警。但裁判本身也是通过API调用的模型，其版本静默更新或评分提示变更都会改变评分方式——因此每次漂移告警都难以区分是产品劣化还是裁判变化。我们通过固定的人类标注锚定集解决此歧义：让当前裁判以稳定间隔重新评分锚定集，构建关于裁判与人类差距的第二条赌博型e过程，并设计带防护窗口的判定规则输出{none, system, judge}三类结果。我们证明了任意时刻有效性、单向识别性（仅裁判能移动锚点）、归因竞速性（设计法则是锚点必须快于其防护的主过程）以及过程正交性。在两个真实裁判变更案例中，静默版本更新在60/60次实验中均被检测为裁判漂移且零误归因于系统，污染性严格提示变更在防护宽度300时于120次实验中正确归因110次——而业界默认的滚动z检验在75%的无漂移流中产生误报。所有实验在第二个领域（TL;DR摘要）无需调参即复现，且领域差异恰与竞速预测吻合：严格提示变更在该领域导致评分偏移更剧烈，因此锚点触发更快、归因达到完美（240/240）。该监控器运行成本约为强裁判全量监控的0.64倍，在更廉价但更粗略模式下仅为0.21倍。