Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.
翻译:多文档摘要旨在为输入文档集合生成简洁的概要。在某些应用场景中,摘要应能针对关键方面准确合成输入信息,例如,针对某部电影的影评摘要应反映评论界的平均共识。一个更具现实意义的例子是,伴随生物医学系统性综述(针对临床试验结果)的叙述性摘要,应能准确总结各独立试验中可能相互矛盾的结果。本文探讨:现代多文档摘要模型在多大程度上隐式执行此类合成操作?我们基于观点合成与证据合成数据集开展实验,使用了一系列摘要模型(从微调后的Transformer模型到GPT-4)。研究发现:现有模型能部分实现合成功能,但存在不足——即使性能最优的模型也对输入顺序的变动过度敏感,同时对输入构成的变化(如正负面评价比例)敏感性不足。我们提出一种简单、通用且有效的方法来提升模型合成能力:首先生成一组显式多样化的候选输出,随后从中选择与输入预期聚合度量最匹配的文本,或在模型未生成合适候选时主动放弃输出。