The effectiveness of automatic evaluation of generative models is typically measured by comparing it to human evaluation using correlation metrics. However, metrics like Krippendorff's $\alpha$ and Randolph's $\kappa$, originally designed to measure the reliability of human labeling, make assumptions about human behavior and the labeling process. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human labels (gathered during human evaluation) is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or better correlation with the human majority label compared to human-to-human (HH) correlation. This can create the misleading impression that automatic evaluation is accurate enough to approximate the human majority label. However, as the proportion of samples with consistent human labels increases, the correlation between machine labels and human majority labels declines, falling below HH correlation. Based on these findings, we first propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance. Second, recognizing that uncertainty and variation are inherent in perception-based human evaluations, such as those involving attitudes or preferences, we introduce a new metric - *binned Jensen-Shannon Divergence for perception* for such scenarios to better measure the effectiveness of automatic evaluations. Third, we present visualization techniques -- *perception charts*, to compare the strengths and limitations of automatic evaluation and to contextualize correlation measures appropriately
翻译:生成模型自动评估的有效性通常通过使用相关性指标将其与人工评估进行比较来衡量。然而,像Krippendorff's $\alpha$ 和 Randolph's $\kappa$ 这类最初为衡量人工标注可靠性而设计的指标,对人工行为及标注过程做出了特定假设。本文揭示了*依赖单一聚合相关性分数*如何可能掩盖人工行为与自动评估方法(包括LLM-as-a-Judge)之间的根本差异。具体而言,我们证明当人工标注(在人工评估过程中收集)存在变异或不确定性的样本比例相对较高时,机器标签(由自动评估方法生成)与人工多数标签之间的相关性,在表面上可能显得与人工-人工(HH)相关性相似甚至更好。这可能造成一种误导性印象,即自动评估足够准确,可以近似人工多数标签。然而,随着具有一致性人工标签的样本比例增加,机器标签与人工多数标签之间的相关性会下降,并低于HH相关性。基于这些发现,我们首先提出按人工标签不确定性对结果进行分层,以提供对自动评估性能更稳健的分析。其次,认识到不确定性和变异是基于感知的人工评估(例如涉及态度或偏好的评估)中固有的,我们为此类场景引入了一种新指标——*感知分箱Jensen-Shannon散度*,以更好地衡量自动评估的有效性。第三,我们提出了可视化技术——*感知图*,用于比较自动评估的优势与局限,并将相关性度量置于恰当的背景中进行解读。