Modern instruction-tuned models have become highly capable in text generation tasks such as summarization, and are expected to be released at a steady pace. In practice one may now wish to choose confidently, but with minimal effort, the best performing summarization model when applied to a new domain or purpose. In this work, we empirically investigate the test sample size necessary to select a preferred model in the context of news summarization. Empirical results reveal that comparative evaluation converges quickly for both automatic and human evaluation, with clear preferences for a system emerging from under 100 examples. The human preference data allows us to quantify how well automatic scores can reproduce preference rankings across a variety of downstream summarization tasks. We find that, while automatic metrics are stable at smaller sample sizes, only some automatic metrics are able to moderately predict model win rates according to human preference.
翻译:现代指令微调模型在文本生成任务(如摘要)中展现出强大的能力,并预计将以稳定速度持续发布。在实际应用中,人们可能希望以最小的代价自信地选出在新领域或新用途中表现最优的摘要模型。本研究通过实证方法探讨在新闻摘要场景下选择偏好模型所需的测试样本量。实验结果表明,无论是自动评估还是人工评估,比较性评价均能快速收敛——系统偏好性仅需不足100个样本即可清晰显现。基于人工偏好数据,我们量化了自动评分在多种下游摘要任务中复现偏好排序的能力。研究发现,尽管自动评估指标在较小样本量下保持稳定性,但仅有部分自动指标能够根据人工偏好适度预测模型胜率。