Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.
翻译:摘要:近期摘要评估领域的前沿进展,多依赖于基于模型的指标来度量完整性、简洁性和忠实度等质量维度。然而,此类方法通常需要大型语言模型,且预测分数常存在校准偏差,限制了其可靠性。此外,针对单个文档中不同摘要的平均质量评估,通常需要访问多个参考摘要。本文提出一个通用框架,无需依赖参考摘要、人工标注或昂贵的基于模型指标,即可生成个体及平均代理分数。同时,我们提出分组等渗回归分箱(GIRB)校准方法,通过调整原始预测值使其与真实评估指标更紧密对齐。尽管本文聚焦于摘要等连续值场景,该方法同样适用于问答等离散值任务。在七个数据集上的实验证明,我们的方法持续优于现有基线。