Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.
翻译:人类评价仍是评估现代AI系统的首要标准,但标注者分歧、偏差和变异性使标准多数投票聚合下的系统排序变得脆弱。多数投票丢弃了标注者可靠性和项目级歧义性,通常在不同标注者子集上产生不稳定的比较。我们提出STABLEVAL,一种分歧感知评估框架,通过建模潜在项目正确性和标注者特定混淆模式,生成后验期望项目得分和校准的智能体级分数。与Dawid-Skene等标签去噪方法不同,STABLEVAL明确针对稳定且不确定感知的系统评估而非硬标签恢复而设计。我们将排序稳定性形式化为一级评估目标,并分析聚合方法如何保持或扭曲底层标注者行为。在受控合成实验和多个真实世界人工标注基准测试中,多数投票在标注者异质性和对抗噪声下表现出递增的分数误差和排序不稳定性,而STABLEVAL产生更稳定且统计基础更坚实的系统排序。这些结果表明,对分歧进行建模对于稳健且可复现的AI评估至关重要。