The massive adoption of large language models (LLMs) demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving. In this work, we introduce a parallel auto-regressive generation method. By instruct-tuning on general domain data that contains hierarchical structures, we enable LLMs to independently plan their generation process and perform auto-parallel auto-regressive (APAR) generation, significantly reducing the number of generation steps. APAR alone can achieve up to 2x speed-up, and when combined with speculative decoding, the speed-up can reach up to 4x. In addition, APAR reduces the key-value cache consumption and attention computation during generation. This leads to a throughput increase of 20-70% and a latency reduce of 20-35% in high-throughput scenarios, compared to state-of-the-art serving frameworks.
翻译:大语言模型(LLM)的广泛应用需要高效的部署策略。然而,作为多数LLM文本生成基础的自回归解码过程,对实现高效服务提出了挑战。本文提出一种并行自回归生成方法。通过对包含层次结构的通用领域数据进行指令微调,我们使LLM能够自主规划生成过程并执行自动并行自回归(APAR)生成,从而显著减少生成步骤数。仅使用APAR即可实现最高2倍加速,当与推测解码结合时,加速比可达4倍。此外,APAR减少了生成过程中的键值缓存消耗和注意力计算量。在高吞吐场景下,与最先进的服务框架相比,这使得吞吐量提升20-70%,延迟降低20-35%。