Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our implementation at: https://github.com/NVIDIA/Megatron-LM#retro.
翻译:大规模仅解码器语言模型(LM)可通过检索(如RETRO)显著降低困惑度,但这种方法对文本生成质量和下游任务准确率的影响尚不明确。因此,一个开放性问题仍然存在:我们是否应该用检索预训练大规模自回归语言模型?为回答这一问题,我们针对可扩展的预训练检索增强LM(即RETRO)与标准GPT及在微调或推理阶段融入检索增强的GPT进行了综合对比。我们首先提供了重现9.5B参数RETRO模型的方案(需检索包含330B词元的文本语料库)。基于此,我们获得以下创新发现:i) RETRO在文本生成中优于GPT,表现为退化(即重复)程度显著降低、事实准确性中等程度提升,且在采用无毒性检索数据库时毒性略微降低。ii) 在LM评估工具(LM Evaluation Harness)基准上,RETRO在知识密集型任务中大幅超越GPT,但在其他任务上与GPT性能相当。此外,我们引入模型简单变体RETRO++,该变体大幅提升了原始RETRO在开放域问答任务中的表现(如Natural Question上EM分数提升8.6),并在微调和零样本评估设置中均显著优于检索增强GPT。我们的发现凸显了用检索预训练自回归LM作为未来基础模型这一方向的潜力。实现代码已开源至:https://github.com/NVIDIA/Megatron-LM#retro。