As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. To address this, Alipay has developed a Retrieval-Augmented Generation (RAG) system that grounds LLMs on the most accurate and up-to-date information. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our RAG system, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a \textit{Trie-based Retrieval} (TR) process that enables the generation of multiple branches simultaneously, each of which is a sequence of tokens. Subsequently, for each branch, a \textit{Verification and Accept} (VA) process is performed to identify the longest correct sub-sequence as the final output. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Code is avaliable: https://github.com/alipay/PainlessInferenceAcceleration.
翻译:随着大语言模型在问答、翻译、文本摘要和对话系统等各类任务中取得显著进展,信息的准确性变得至关重要——尤其对于像支付宝这样服务数十亿用户的严肃金融产品而言。为此,支付宝开发了基于检索增强生成系统,使大语言模型能够依托最准确和最新的信息进行响应。然而,相较于实验性模型,服务于数百万用户的真实产品中,大语言模型的推理速度成为关键因素。本文提出了一种通用推理加速框架,在保证生成精度的前提下,显著提升我们RAG系统的处理速度并降低成本。在传统推理过程中,模型须逐个生成词元,导致耗时与生成词元数量成正比。为优化此过程,我们提出的框架——名为 *lookahead*——引入了"多分支"策略。与传统单次生成单个词元不同,我们设计了基于字典树的检索过程,使系统能同时生成多个分支(每个分支为一个词元序列)。随后对各分支执行"验证与接受"过程,识别并输出最长的正确子序列。本策略具有两个显著优势:(1) 确保输出的绝对正确性,避免任何近似算法的使用;(2) 最差情况下的性能与传统流程相当。我们通过大量实验证明,应用该推理加速框架能实现显著性能提升。代码地址:https://github.com/alipay/PainlessInferenceAcceleration