Large language models record impressive performance on many natural language processing tasks. However, their knowledge capacity is limited to the pretraining corpus. Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources to complement the language model. However, existing retrieval augmentation techniques ignore the structural relationships between these documents. Furthermore, retrieval models are not explored much in scientific tasks, especially in regard to the faithfulness of retrieved documents. In this paper, we propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation. We create a heterogeneous document graph capturing multiple types of relationships (e.g., citation, co-authorship, etc.) that connect documents from more than 15 scientific disciplines (e.g., Physics, Medicine, Chemistry, etc.). We train a graph neural network on the curated document graph to act as a structural encoder for the corresponding passages retrieved during the model pretraining. Particularly, along with text embeddings of the retrieved passages, we obtain structural embeddings of the documents (passages) and fuse them together before feeding them to the language model. We evaluate our model extensively on various scientific benchmarks that include science question-answering and scientific document classification tasks. Experimental results demonstrate that structure-aware retrieval improves retrieving more coherent, faithful and contextually relevant passages, while showing a comparable performance in the overall accuracy.
翻译:大型语言模型在许多自然语言处理任务上展现出卓越性能。然而,其知识容量受限于预训练语料库。检索增强通过从外部知识源检索上下文来补充语言模型,提供了一种有效解决方案。但现有检索增强技术忽视了文档间的结构关系,且检索模型在科学任务中尚未得到充分探索,尤其是在检索文档的忠实性方面。本文提出一种新型结构感知检索增强语言模型,在检索增强过程中融入文档结构信息。我们构建了一个异构文档图,捕获连接15个以上科学学科(如物理、医学、化学等)文档的多类关系(如引文、合著关系等)。在精心整理的文档图上训练图神经网络,使其作为模型预训练期间所检索对应段落的编码器。具体而言,我们将检索文本的文本嵌入与文档的结构嵌入融合后输入语言模型。我们在涵盖科学问答与科学文档分类任务的多个科学基准上进行了全面评估。实验结果表明,结构感知检索能够提升更连贯、忠实且上下文相关段落的检索质量,同时在整体准确率上保持可比性能。