Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

Research on automated text summarization relies heavily on human and automatic evaluation. While recent work on human evaluation mainly adopted intrinsic evaluation methods, judging the generic quality of text summaries, e.g. informativeness and coherence, our work focuses on evaluating the usefulness of text summaries with extrinsic methods. We carefully design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We carry out experiments using system rankings and user behavior data to evaluate the performance of different summarization models. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks. The results show that summaries generated by fine-tuned models lead to higher consistency in usefulness across all three tasks, as rankings of fine-tuned summarization systems are close across downstream tasks according to the proposed extrinsic metrics. Summaries generated by models in the zero-shot setting, however, are found to be biased towards the text classification and similarity assessment tasks, due to its general and less detailed summary style. We further evaluate the correlation of 14 intrinsic automatic metrics with human criteria and show that intrinsic automatic metrics perform well in evaluating the usefulness of summaries in the question-answering task, but are less effective in the other two tasks. This highlights the limitations of relying solely on intrinsic automatic metrics in evaluating the performance and usefulness of summaries.

翻译：自动文本摘要研究很大程度上依赖于人类和自动评估。近期的人类评估工作主要采用内在评估方法，判断文本摘要的通用质量（如信息量和连贯性），而我们的工作则专注于通过外部方法评估文本摘要的有用性。我们精心设计了三种不同的下游任务用于摘要的外部人类评估，即问答、文本分类和文本相似度评估。我们利用系统排名和用户行为数据进行实验，评估不同摘要模型的性能。研究发现：在依赖文本整体判断的任务中，摘要特别有用，但在问答任务中效果较差。结果表明，精调模型生成的摘要在这三项任务中的有用性具有更高的一致性——根据所提出的外部指标，精调摘要系统在不同下游任务中的排名相近。然而，零样本设置下模型生成的摘要由于风格通用且细节较少，被发现偏向于文本分类和相似度评估任务。我们进一步评估了14种内在自动指标与人类标准的相关性，表明内在自动指标在评估问答任务中摘要的有用性时表现良好，但在其他两项任务中效果较差。这凸显了仅依赖内在自动指标评估摘要性能和有用性的局限性。