Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while "peeking" into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.

翻译：将多文档预训练目标整合到语言模型，已在多文档下游任务中取得显著改进。本研究提出扩展这一思路，通过新颖的跨文档问答预训练目标来预训练通用多文档模型。具体而言，针对一组（或聚类）主题相关的文档，我们从某篇文档的显著句中系统生成语义导向问题，并在此过程中挑战模型在"窥视"其他主题相关文档的同时回答这些问题。类似地，模型还需在利用跨文档信息的前提下，恢复生成问题的原句。这种新型多文档问答范式引导模型更好地捕捉文本间的跨文档信息关联，并通过自然数据增强人工增加预训练数据量。与以往专注于分类或摘要任务的多文档模型不同，我们的预训练目标设计使模型既能执行短文本生成任务（如问答）又能处理长文本生成任务（如摘要）。基于该方案，我们预训练了QAmden模型，并在多文档问答、摘要及查询聚焦摘要等多个多文档任务中评估其性能，最终实现最高7%的性能提升，且显著超越零样本设置的GPT-3.5和GPT-4。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日