Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our code and model at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md

翻译：仅解码器的大语言模型（LM）通过检索（如RETRO）可在困惑度上获得大幅提升，但其对文本生成质量与下游任务准确率的影响尚不明确。因此，一个核心问题始终存在：我们是否应该在大规模自回归语言模型的预训练阶段引入检索？为回答此问题，我们针对可扩展的预训练检索增强语言模型（即RETRO），与标准GPT及在微调或推理阶段引入检索增强的GPT进行系统性对比。首先，我们提供了复现RETRO的完整方案——支持高达95亿参数规模，并基于包含3300亿词元的文本语料库进行检索。基于此，我们获得以下创新发现：i) RETRO在文本生成中显著优于GPT，表现为更低的退化（即重复）程度、中等程度更高的事实准确性，以及在使用无毒性检索数据库时更低的毒性输出。ii) 在LM评估基准上，RETRO在知识密集型任务上大幅超越GPT，而在其他任务中与GPT性能持平。此外，我们提出模型简单变体RETRO++，该模型极大改进了原始RETRO在开放域问答中的表现（例如自然问题数据集EM分数提升8.6），并在微调与零样本评估两种场景下显著优于检索增强型GPT。我们的研究揭示了将检索融入预训练自回归语言模型作为未来基础模型的潜力。相关代码与模型已开源至：https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md