Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $κ$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $ρ$, Kendall's $τ_b$, the phi coefficient $φ$, and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence. Cohen's $κ$ is the one agreement coefficient that adds information: it shares $φ$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences. The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss' $κ$ or Krippendorff's $α$. We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.

翻译：通常，将LLM评估器与人类标注进行验证时，需要报告多项一致性统计指标：准确率、精确率、召回率、$F_1$、Cohen's $κ$以及一个或多个秩相关系数。对24篇近期LLM评估相关论文的调研发现，指标选择与判断尺度、平局处理、无效输出及弃权处理相互纠缠，且这些选择极少被明确说明。对于二元准则（即基于评分标准的评估中的常见情形，每个准则被评定为"达标"或"未达标"），报告的大多数数值存在冗余：Pearson's $r$、Spearman's $ρ$、Kendall's $τ_b$、phi系数$φ$和Matthews相关系数在非退化二元数据上均可简化为单一数值，因此同时报告多个指标只会造成证据相互印证的错觉。Cohen's $κ$是唯一能增加信息量的一致性系数：它与$φ$共享分子但归一化方式不同，两者之间的差距可衡量评估器正例比率偏离人类标注的程度。随后，本文追溯了当评估者可能以"无法评估"结论弃权时的变化：三种常见弃权处理方式并非可互换的预处理选择，而是回答不同的问题，且它们会打破二元等价关系。当采用Fleiss' $κ$或Krippendorff's $α$对多评估器集成进行评分时，上述等价关系会重新出现（仅存在可忽略的有限样本校正）。最后，本文提供了一份报告清单，要求明确说明判断尺度、弃权与平局处理模式、覆盖率、混淆矩阵、聚合层级以及任何标量一致性系数。