An abundance of datasets exist for training and evaluating models on the task of summary generation.However, these datasets are often derived heuristically, and lack sufficient annotations to support research into all aspects of summarization, such as evidence extraction and controllable summarization. We introduce a benchmark comprising 8 tasks that require multi-dimensional understanding of summarization, e.g., surfacing evidence for a summary, assessing its correctness, and gauging its relevance to different topics. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality related tasks, we also evaluate existing heuristics to create training data and find that training on them performs worse than training on $20\times$ less human-labeled data. Our benchmark consists of data from 6 different domains, allowing us to study cross-domain performance of trained models. We find that for some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial. Our work fulfills the need for a well-annotated summarization benchmark with diverse tasks, and provides useful insights about the impact of the quality, size and domain of training data.
翻译:现有大量数据集可用于训练和评估摘要生成模型。然而,这些数据集通常基于启发式方法构建,缺乏足够的标注以支持摘要生成所有方面的研究(例如证据提取与可控摘要)。我们提出一个包含8项任务的基准,这些任务需对摘要进行多维理解,包括揭示摘要证据、评估其正确性、衡量其与不同主题的相关性。我们在此基准上比较多种方法,发现中等规模的微调模型在多项任务中始终优于规模大得多的少样本提示语言模型。针对事实性任务,我们还评估了现有用于生成训练数据的启发式方法,发现其训练效果低于使用规模小20倍的人工标注数据。本基准包含来自6个不同领域的数据,可研究训练模型的跨领域性能。研究发现,对于某些任务,训练数据量比数据来源领域更重要;而对其他任务,即使数据有限,针对目标领域的数据训练也更为有效。本研究满足了构建包含多样化任务的高质量标注摘要基准的需求,并为训练数据的质量、规模和领域影响提供了重要见解。