Improving Retrieval-Augmented Large Language Models via Data Importance Learning

Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).

翻译：检索增强使大型语言模型能够利用外部知识，例如在问答和数据插补等任务中。然而，此类检索增强模型的性能受限于其底层检索语料库的数据质量。本文提出一种基于多重线性扩展的算法，用于评估检索数据点的数据重要性。多重线性扩展中存在指数级数量的项，本文的关键贡献之一在于提出一种多项式时间算法，该算法能在给定具有可加效用函数和验证集的检索增强模型时，利用模型效用函数的多重线性扩展精确计算检索语料库中数据点的数据重要性。我们进一步提出了一种更高效的(ε, δ)-近似算法。实验结果表明，仅通过修剪或重新加权检索语料库，无需额外训练即可提升大型语言模型的性能。对于某些任务，这甚至能使增强搜索引擎API的小型模型（如GPT-JT）超越GPT-3.5（无检索增强）。此外，我们证明基于多重线性扩展的权重可在实际中高效计算（例如，对于包含1亿个元素的语料库，计算时间不超过十分钟）。

相关内容