Automatically summarizing large text collections is a valuable tool for document research, with applications in journalism, academic research, legal work, and many other fields. In this work, we contrast two classes of systems for large-scale multi-document summarization (MDS): compression and full-text. Compression-based methods use a multi-stage pipeline and often lead to lossy summaries. Full-text methods promise a lossless summary by relying on recent advances in long-context reasoning. To understand their utility on large-scale MDS, we evaluated them on three datasets, each containing approximately one hundred documents per summary. Our experiments cover a diverse set of long-context transformers (Llama-3.1, Command-R, Jamba-1.5-Mini) and compression methods (retrieval-augmented, hierarchical, incremental). Overall, we find that full-text and retrieval methods perform the best in most settings. With further analysis into the salient information retention patterns, we show that compression-based methods show strong promise at intermediate stages, even outperforming full-context. However, they suffer information loss due to their multi-stage pipeline and lack of global context. Our results highlight the need to develop hybrid approaches that combine compression and full-text approaches for optimal performance on large-scale multi-document summarization.
翻译:自动汇总大规模文本集合是文档研究中的一项重要工具,在新闻、学术研究、法律工作及众多其他领域均有应用。本研究对比了大规模多文档摘要(MDS)的两类系统:压缩方法与全文方法。基于压缩的方法采用多阶段处理流程,常导致摘要信息损失;而全文方法依托长上下文推理的最新进展,有望实现无损摘要。为评估其在大规模MDS任务中的实用性,我们在三个数据集上进行了实验,每个数据集的单个摘要均涉及约百篇文档。实验涵盖了多样化的长上下文Transformer模型(Llama-3.1、Command-R、Jamba-1.5-Mini)与压缩方法(检索增强型、层次化、增量式)。总体而言,全文方法与检索增强方法在多数场景中表现最佳。通过对关键信息保留模式的深入分析,我们发现基于压缩的方法在中间处理阶段展现出显著潜力,甚至能超越全文方法。然而,其多阶段流程与全局上下文缺失导致了信息损失。本研究结果凸显了开发融合压缩与全文方法的混合方案的必要性,以实现大规模多文档摘要任务的最优性能。