Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top-$k$ retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments show our model, compared with long-range LM baselines, can achieve lower perplexity with comparable or lower pre-training and inference costs.
翻译:近年来,基于检索的语言模型(RLMs)受到了广泛关注。然而,其中大多数模型利用的是参数固定的预训练检索器,这可能无法很好地适应因果语言模型。在本工作中,我们提出了分组交叉注意力,这是一种新颖的模块,能够实现检索器与因果语言模型的联合预训练,并将其应用于长上下文建模。对于给定的输入序列,我们将其分割为多个块,并使用当前块来检索过去的块,以用于后续的文本生成。我们的创新使得检索器能够学习如何检索过去的块,从而以端到端的方式更好地最小化后续标记的自回归损失。通过集成 top-$k$ 检索,我们的模型可以从头开始高效地进行预训练,上下文长度可达 64K 个标记。实验表明,与长距离语言模型基线相比,我们的模型在具有可比或更低的预训练及推理成本的同时,能够实现更低的困惑度。