Retrieval-augmented language models show promise in addressing issues like outdated information and hallucinations in language models (LMs). However, current research faces two main problems: 1) determining what information to retrieve, and 2) effectively combining retrieved information during generation. We argue that valuable retrieved information should not only be related to the current source text but also consider the future target text, given the nature of LMs that model future tokens. Moreover, we propose that aggregation using latent variables derived from a compact latent space is more efficient than utilizing explicit raw text, which is limited by context length and susceptible to noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE). It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Additionally, we leverage the VAE to initialize the latent space and adopt the probabilistic form of the retrieval generation paradigm by expanding the Gaussian prior distribution into a Gaussian mixture distribution. Theoretical analysis provides an optimizable upper bound for RegaVAE. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
翻译:检索增强语言模型在解决语言模型(LM)中的信息过时和幻觉问题方面展现出潜力。然而,当前研究面临两个主要问题:1)确定检索何种信息;2)在生成过程中有效结合检索到的信息。我们认为,鉴于LM建模未来令牌的特性,有价值的检索信息不仅应与当前源文本相关,还应考虑未来目标文本。此外,我们提出,利用源自紧凑潜在空间的潜在变量进行聚合,比使用受上下文长度限制且易受噪声影响的显式原始文本更为高效。因此,我们引入RegaVAE,一种基于变分自编码器(VAE)构建的检索增强语言模型。它将文本语料库编码到潜在空间中,捕获来自源文本和目标文本的当前与未来信息。此外,我们利用VAE初始化潜在空间,并通过将高斯先验分布扩展为高斯混合分布,采用检索生成范式的概率形式。理论分析为RegaVAE提供了可优化的上界。在多种数据集上的实验结果表明,该方法在文本生成质量和幻觉消除方面取得了显著改进。