This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6). The LCFO benchmark offers a standardized platform for evaluating summarization and summary expansion performance, as well as corresponding automatic metrics, thereby providing an important evaluation framework to advance generative AI.
翻译:本文提出了长上下文与长格式输出(LCFO)基准测试,这是一个用于评估跨领域渐进式摘要与摘要扩展能力的新型评估框架。LCFO包含长输入文档(平均长度5000词),每篇文档均附有三种不同长度的摘要(分别为输入文本的20%、10%和5%),以及约15个与输入内容相关的问题与答案(QA)。值得注意的是,LCFO还在7个领域中提供了特定QA对与相应摘要之间的对齐标注。提供不同长度摘要的主要动机在于建立一个从较短输入生成长文本的可控框架,即摘要扩展任务。为了建立针对摘要生成与摘要扩展的评估指标框架,我们提供了人工生成输出的人类评估分数,以及多种前沿大语言模型(LLM)的测试结果。在自动系统中,GPT-4o-mini在摘要生成与摘要扩展任务中均获得最佳人类评分(分别提升约+10%与+20%),其在短摘要任务中的输出质量甚至超越人类表现(约+7%)。总体自动评估指标与人类评分相关性较低(约0.4),但在流畅性与归因性等特定评估维度上呈现中等相关性(约0.6)。LCFO基准测试为评估摘要生成、摘要扩展性能及相关自动指标提供了标准化平台,从而为推动生成式人工智能发展提供了重要的评估框架。