As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. To address this, Alipay has developed a Retrieval-Augmented Generation (RAG) system that grounds LLMs on the most accurate and up-to-date information. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our RAG system, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a \textit{Trie-based Retrieval} (TR) process that enables the generation of multiple branches simultaneously, each of which is a sequence of tokens. Subsequently, for each branch, a \textit{Verification and Accept} (VA) process is performed to identify the longest correct sub-sequence as the final output. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework.
翻译:随着大型语言模型(LLMs)在问答、翻译、文本摘要和对话系统等各类任务中取得显著进展,信息的准确性变得至关重要——尤其是对于服务数十亿用户的严肃金融产品(如支付宝)而言。为此,支付宝开发了检索增强生成(RAG)系统,使LLMs能够基于最准确、最新的信息进行推理。然而,对于服务数百万用户的真实产品,LLMs的推理速度成为比单纯实验模型更关键的因素。因此,本文提出一种通用推理加速框架,在保持生成精度无损的前提下,显著提升RAG系统的速度并降低成本。传统推理过程中,LLM逐个生成词元,耗时与生成词元数量成正比。为改进这一过程,我们提出名为\textit{lookahead}的框架,引入\textit{多分支}策略。不同于单次生成单个词元,我们提出基于\textit{Trie树的检索}(TR)过程,支持同时生成多个分支(每个分支为一个词元序列)。随后,对每个分支执行\textit{验证与接受}(VA)过程,识别最长的正确子序列作为最终输出。该策略具有两大优势:(1)保证输出绝对正确,无需使用近似算法;(2)最坏情况下的性能与传统推理过程相当。我们通过大量实验证明了该推理加速框架带来的显著改进。