Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics under both a per-task rank-correlation metric (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) and a per-judgment pairwise win-rate metric (0.669 vs. 0.542, estimated difference = 0.127 [0.067, 0.186]), while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.
翻译:摘要:当前基准测评实践主要依赖两种方法论:基于评分标准的评估(根据预定义准则对项目打分)与比较判断(通过输出间的两两偏好比较)。尽管两种方法被广泛使用,但选择依据鲜有论证。我们发布 JudgmentBench——包含30个真实法律任务的基准数据集,附带由执业律师(包括美国顶级律所律师)提供的1,539项评分标准分数与1,530项两两偏好判断,所有标注者均具有丰富从业经验。这是首个在高度专业化领域公开的数据集——同一专家针对相同项目同时提供两种监督信号。通过采用大语言模型在三个质量层级生成的输出,我们开展初步实证比较:在每项任务的秩相关性指标(平均斯皮尔曼秩相关系数0.908 vs 0.150,估计差异=0.758 [0.494, 1.021])和每次判断的成对胜率指标(0.669 vs 0.542,估计差异=0.127 [0.067, 0.186])下,比较判断恢复预设质量排序的效果显著优于评分标准,且标注时间减少过半。该规律在人工标注与LLM自动评分场景中均成立。除初步比较外,数据集的配对结构为更广泛的研究议题奠定基础——在缺乏可验证真相的领域中,如何收集、聚合专家判断并作为监督信号使用。