The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment, and which makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced
翻译:近期基于Transformer架构的大型语言模型的出现推动了自然语言处理领域的重大进步。然而,这些模型存在推理延迟较长的问题,限制了其部署,并使其难以应用于各类实时场景。自回归生成任务进一步加剧了推理延迟,因为模型需要迭代运行以顺序生成词元,而无法利用词元级别的并行化。为解决这一问题,我们提出大小时序解码器(BiLD)框架,该框架能够提升广泛文本生成应用的推理效率并降低延迟。BiLD框架包含两个不同规模的模型,通过协同方式生成文本。小模型以低推理成本自回归地生成文本,而大模型仅以非自回归方式偶尔介入,用于修正小模型的不准确预测。为协调大小模型,BiLD引入了两种简单有效的策略:(1)回退策略,决定何时将控制权交给大模型;(2)回滚策略,确定大模型何时需要纠正小模型的不准确预测。为在不同任务和模型上评估框架,我们将BiLD应用于多种文本生成场景,包括IWSLT 2017德英和WMT 2014德英的机器翻译任务,以及XSUM和CNN/DailyMail的摘要任务。在NVIDIA T4 GPU上,该框架在最小化生成质量损失的前提下实现了最高2.12倍的加速。此外,我们的框架完全即插即用,无需对训练过程或模型架构进行任何修改。代码已开源。