Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.
翻译:摘要:评估摘要生成系统的质量面临重大挑战。为此,我们提出了一种新颖的任务导向评估方法,该方法根据摘要生成系统生成对下游任务有用的摘要、同时保持任务结果不变的能力进行评估。我们从理论上建立了这些任务的最终错误概率与源文本和生成摘要之间互信息的直接关系。我们引入了$\texttt{COSMIC}$作为该指标的实际实现,展示了其与基于人类判断的指标之间的强相关性,以及其在预测下游任务性能方面的有效性。与$\texttt{BERTScore}$和$\texttt{ROUGE}$等现有指标的比较分析,突显了$\texttt{COSMIC}$的竞争性能。