LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
翻译:大语言模型作为判官(LLM-as-judge)框架在自动自然语言生成(NLG)评估中日益普及,但其逐实例的可靠性仍鲜有深入理解。我们针对SummEval提出了一套双管齐下的诊断工具:(1) 传递性分析揭示出因低聚合违背率($\barρ = 0.8$-$4.1\%$)所掩盖的广泛逐输入不一致性,其中$33$-$67\%$的文档至少出现一个有向3环;(2) 在1-5李克特评分上采用拆分共形预测集,提供理论上保证的$\geq(1{-}α)$覆盖概率,其集合宽度作为逐实例可靠性指标(所有判官合并分析:$r_s = {+}0.576$,$N{=}1{,}918$,$p < 10^{-100}$)。关键的是,预测集宽度呈现跨判官的一致性($\bar{r} = 0.32$-$0.38$),证明其捕获的是文档级难度而非判官特定噪声。在四个判官与四个评估标准上,两种诊断结果趋于一致:评估标准的重要性高于判官,其中相关性评估最可靠(平均集合大小$\approx 3.0$),连贯性中等可靠(平均集合大小$\approx 3.9$),而流畅性与一致性仍不可靠(平均集合大小$\approx 4.9$)。我们开源所有代码、提示词及缓存结果。