Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. Our results highlight the promising direction to obtain a better GPT decoder through continued pretraining with retrieval before instruction tuning. Our code and checkpoints are publicly available at: https://github.com/NVIDIA/Megatron-LM/tree/InstructRetro/tools/retro.
翻译:利用外部数据库进行检索增强的自回归大语言模型(LLM)预训练,能够提升模型的困惑度和事实准确性。然而,现有预训练检索增强型LLM的规模仍然有限(例如Retro仅有7.5B参数),这限制了指令微调和零样本泛化的有效性。在本工作中,我们提出了Retro 48B——目前规模最大的检索增强预训练LLM。具体而言,我们采用Retro增强方法,通过从1.2万亿个token中检索信息,对43B的GPT模型在额外1000亿个token上继续预训练。值得注意的是,仅增加2.58%的GPU计算时间,所得到的基座模型Retro 48B在困惑度上便显著优于在1.2T token上训练的同规模GPT 43B模型,充分体现了该方法强大的扩展潜力。经过指令微调后,InstructRetro在广泛的零样本任务上展现出相比指令微调GPT的显著提升。具体而言,在8个短格式问答和阅读理解任务中,InstructRetro的平均性能比对应GPT模型提升7%;在4项具有挑战性的长格式问答任务中提升10%;在3项摘要任务中提升16%。令人惊讶的是,我们发现可以从InstructRetro架构中移除编码器,直接使用其解码器主干,仍能达到可比较的结果。我们的研究结果揭示了在指令微调前通过检索增强继续预训练来获取更优GPT解码器的有前景方向。我们的代码和检查点已公开在:https://github.com/NVIDIA/Megatron-LM/tree/InstructRetro/tools/retro。