Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.
翻译:检索增强语言模型(RaLM)通过结合非参数化知识库与参数化语言模型,展示了解决知识密集型自然语言处理(NLP)任务的潜力。与微调完全参数化模型不同,RaLM在低成本适应最新数据及更优来源归因机制方面表现突出。在各种RaLM方法中,迭代式RaLM由于检索器与语言模型之间更频繁的交互,带来了更好的生成质量。然而,尽管具有优势,迭代式RaLM因频繁的检索步骤通常面临高开销。为此,我们提出RaLMSpec,一种基于推测启发的框架,通过推测性检索与批量验证,在保持相同模型输出的同时实现迭代式RaLM的通用加速。通过进一步整合预取、最优推测步长调度器及异步验证,RaLMSpec能够自动充分发挥加速潜力。对于朴素迭代式RaLM服务,在三个语言模型上的四个下游问答数据集上进行的大量评估表明,相较于基线,当检索器为精确稠密检索器、近似稠密检索器和稀疏检索器时,RaLMSpec分别可实现1.75-2.39倍、1.04-1.39倍和1.31-1.77倍的加速比。对于KNN-LM服务,相较于基线,当检索器为精确稠密检索器和近似稠密检索器时,RaLMSpec分别可实现高达7.59倍和2.45倍的加速比。