Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval before instruction tuning. Specifically, we continue to pretrain the 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. The obtained foundation model, Retro 48B, largely outperforms the original 43B GPT in terms of perplexity. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on zero-shot question answering (QA) tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, and 10% over GPT across 4 challenging long-form QA tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. We hypothesize that pretraining with retrieval makes its decoder good at incorporating context for QA. Our results highlights the promising direction to obtain a better GPT decoder for QA through continued pretraining with retrieval before instruction tuning.
翻译:通过利用外部数据库对自回归大语言模型(LLMs)进行预训练,能够提升其困惑度指标和事实准确性。然而,现有预训练检索增强LLM的规模仍有限(例如Retro仅有7.5B参数),这限制了指令微调与零样本泛化的有效性。本研究提出Retro 48B,这是目前参数量最大的预训练检索增强LLM(在指令微调前)。具体而言,我们采用Retro增强方法,从1.2万亿token语料库中检索数据,对43B GPT模型额外进行1000亿token的持续预训练。由此获得的基础模型Retro 48B在困惑度指标上大幅超越原始43B GPT模型。对Retro进行指令微调后,InstructRetro在零样本问答(QA)任务上展现出显著优于指令微调GPT模型的性能:在8个短文本问答任务中,InstructRetro平均比对应GPT模型提升7%;在4个挑战性长文本问答任务中则提升10%。令人惊讶的是,我们发现消融InstructRetro架构中的编码器、直接使用其解码器主干网络时,仍能获得可比结果。我们推测,检索增强预训练使其解码器能够有效整合上下文信息以应对问答任务。本研究结果凸显了通过指令微调前的检索增强持续预训练,获得更优GPT问答解码器的可行方向。