Large language models (LLMs) encode a large amount of world knowledge. However, as such knowledge is frozen at the time of model training, the models become static and limited by the training data at that time. In order to further improve the capacity of LLMs for knowledge-intensive tasks, we consider augmenting LLMs with the large-scale web using search engine. Unlike previous augmentation sources (e.g., Wikipedia data dump), the web provides broader, more comprehensive and constantly updated information. In this paper, we present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format. Instead of simply using the retrieved contents from web, our approach has made two major improvements. Firstly, we propose an adaptive search engine assisted learning method that can self-evaluate the confidence level of LLM's predictions, and adaptively determine when to refer to the web for more data, which can avoid useless or noisy augmentation from web. Secondly, we design a pretraining task, i.e., continual knowledge learning, based on salient spans prediction, to reduce the discrepancy between the encoded and retrieved knowledge. Experiments on a wide range of knowledge-intensive tasks show that our model significantly outperforms previous retrieval-augmented methods.
翻译:大型语言模型(LLM)编码了大量世界知识。然而,由于此类知识在模型训练时被固化,模型会变得静态并受限于当时的训练数据。为了进一步提升LLM在知识密集型任务中的能力,我们考虑利用搜索引擎将大规模网络信息增强至LLM。与以往的数据源(如维基百科数据转储)不同,网络提供了更广泛、更全面且持续更新的信息。本文提出了一种网络增强型大语言模型UNIWEB,该模型在16项知识密集型任务上以统一的文本到文本格式进行训练。相比简单使用网络检索内容,我们的方法实现了两项关键改进:第一,提出一种自适应搜索引擎辅助学习方法,可自我评估LLM预测的置信度,并自适应地决定何时参考网络获取更多数据,从而避免来自网络的无用或噪声增强;第二,设计了一个基于显著跨度预测的预训练任务——持续知识学习,以减少编码知识与检索知识之间的差异。在涵盖多种知识密集型任务的实验中,我们的模型显著优于以往的检索增强方法。