Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
翻译:超越将异构提示下的性能评估简单汇总的传统方法,转向在提示层面或相对同质的子集中进行细粒度评估,对于诊断生成模型的优势与不足至关重要。然而,此类细粒度评估面临数据瓶颈:人工黄金标准标注在此规模下成本过高,而自动化评分常与人类判断存在偏差。为解决这一挑战,我们提出一种基于张量分解的新型统计模型,将廉价的自动评分数据与有限的人工黄金标准标注相结合。具体而言,我们的方法利用自动评分对提示和生成模型的潜在表征进行预训练,随后通过一个小型校准集将这些预训练表征与人类偏好对齐。这种样本高效的方法对自动评分质量具有鲁棒性,在单提示层面比标准基线更准确地预测人类偏好,并为关键统计参数提供严格的置信区间。我们还通过构建基于提示特性的细粒度排行榜,以及仅从自动评分估计模型性能(无需额外人工标注),展示了本方法的实际应用价值。