Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks including question answering, query-based summarization, and in-context few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.
翻译:扩展大语言模型(LLMs)的上下文窗口近期备受关注,而利用检索增强LLMs的方案已存在多年。自然产生的问题是:i) 对于下游任务,检索增强与长上下文窗口孰优孰劣?ii) 两种方法能否结合以实现优势互补?本研究通过两种最先进的预训练LLMs(专有43B GPT模型与Llama2-70B模型)对上述方案进行探究。出乎意料的是,我们发现在长上下文任务中,采用简单检索增强生成的4K上下文窗口LLM,其性能可与通过位置插值微调后的16K上下文窗口LLM相媲美,且计算量显著降低。更重要的是,我们证明无论LLM的扩展上下文窗口大小如何,检索均能显著提升其性能。我们最优模型——配置32K上下文窗口的检索增强型Llama2-70B——在涵盖问答、基于查询的摘要及上下文少样本学习等九项长上下文任务中的平均分超越GPT-3.5-turbo-16k与Davinci003。该模型不仅生成速度更快,且显著优于无检索基线的Llama2-70B-32k模型。本研究为从业者在检索增强与长上下文扩展方案间的选择提供了通用性指导。