Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.
翻译:评估摘要生成器的质量是一项重大挑战。为此,我们提出了一种新型任务导向型评估方法,通过衡量摘要生成器生成对下游任务有用的摘要且保持任务结果的能力来进行评估。我们从理论上建立了任务结果错误概率与源文本及生成摘要间互信息之间的直接关系。我们引入$\texttt{COSMIC}$作为该指标的实际实现方案,证明其与基于人工评判的指标具有强相关性,并能有效预测下游任务性能。与$\texttt{BERTScore}$和$\texttt{ROUGE}$等现有指标的对比分析凸显了$\texttt{COSMIC}$的竞争性能表现。