The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while "peeking" into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.
翻译:将多文档预训练目标整合到语言模型,已在多文档下游任务中取得显著改进。本研究提出扩展这一思路,通过新颖的跨文档问答预训练目标来预训练通用多文档模型。具体而言,针对一组(或聚类)主题相关的文档,我们从某篇文档的显著句中系统生成语义导向问题,并在此过程中挑战模型在"窥视"其他主题相关文档的同时回答这些问题。类似地,模型还需在利用跨文档信息的前提下,恢复生成问题的原句。这种新型多文档问答范式引导模型更好地捕捉文本间的跨文档信息关联,并通过自然数据增强人工增加预训练数据量。与以往专注于分类或摘要任务的多文档模型不同,我们的预训练目标设计使模型既能执行短文本生成任务(如问答)又能处理长文本生成任务(如摘要)。基于该方案,我们预训练了QAmden模型,并在多文档问答、摘要及查询聚焦摘要等多个多文档任务中评估其性能,最终实现最高7%的性能提升,且显著超越零样本设置的GPT-3.5和GPT-4。