As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our LLM-based scenarios, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a Trie-based retrieval and verification mechanism to be able to accept several tokens at a forward step. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Our framework is widely deployed in Alipay since April 2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at https://github.com/alipay/PainlessInferenceAcceleration.
翻译:随着大型语言模型(LLM)在问答、翻译、文本摘要和对话系统等各项任务中取得显著进展,信息生成的准确性变得至关重要,尤其对于像支付宝这样服务数十亿用户的严肃金融产品而言。然而,对于一个服务数百万用户的现实产品,相比于纯粹的实验模型,LLM的推理速度成为一个关键因素。因此,本文提出了一种通用的推理过程加速框架,该框架在我们基于LLM的场景中实现了速度的大幅提升和成本降低,同时保持无损的生成精度。在传统的推理过程中,每个令牌由LLM顺序生成,导致时间消耗与生成的令牌数量成正比。为了改进这一过程,我们提出了名为\textit{前瞻解码}的框架,引入了一种\textit{多分支}策略。我们不再一次生成单个令牌,而是提出了一种基于Trie树的检索与验证机制,使得在前向步骤中能够接受多个令牌。我们的策略具有两个显著优势:(1) 它保证了输出的绝对正确性,避免了任何近似算法;(2) 我们方法的最坏情况性能等同于传统过程。我们进行了大量实验,以证明应用我们的推理加速框架所取得的显著改进。我们的框架自2023年4月起已在支付宝广泛部署,并获得了2.66倍至6.26倍的显著加速。我们的代码可在 https://github.com/alipay/PainlessInferenceAcceleration 获取。