While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluations of five LLM-based systems to assess their instruction-following capabilities in controllable summarization. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) no LLM-based evaluation methods can achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation capabilities. We make our collected benchmark InstruSum publicly available to facilitate future research in this direction.
翻译:尽管大型语言模型(LLMs)在标准通用摘要基准测试中已能取得优异表现,但其在更复杂摘要任务场景下的性能尚未得到充分研究。为此,我们对LLMs在指令可控文本摘要任务上的表现进行系统性评估,该任务的模型输入同时包含源文本和用于描述期望摘要特征的自然语言要求。为此,我们专门构建了针对该任务的纯评估数据集,并对五个基于LLM的系统进行人工评估,以衡量其在可控摘要任务中的指令遵循能力。随后,我们采用4种不同评估协议和11种LLM对基于LLM的自动评估方法进行基准测试,共形成40种评估方案。研究表明:指令可控文本摘要对LLMs仍具挑战性,因为(1)所有被评估LLM生成的摘要仍存在事实性及其他类型错误;(2)在评判候选摘要质量时,尚无基于LLM的评估方法能与人类标注者达成高度一致;(3)不同LLM在摘要生成与评估能力方面存在显著性能差距。我们将所构建的基准数据集InstruSum公开,以促进该领域的后续研究。