Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
翻译:基于Transformer架构的生成式大型语言模型最近已成为各类自然语言处理任务中的主流基础模型。然而,由于这些模型存在显著的推理延迟,其在实时场景中的应用受到极大限制。这一现象尤为突出,因为生成式LLM的推理具有自回归特性:每个令牌的生成依赖于之前所有的输出令牌,导致令牌间难以实现并行化,从而使推理过程极度受限于内存带宽。本文提出SPEED方法,通过利用早期层隐藏状态的预测值,在生成当前令牌时并行推测执行多个未来令牌,从而提升推理效率。对于采用参数共享的Transformer解码器,并行执行令牌的内存操作可被分摊,进而加速生成式LLM推理。我们通过延迟降低与模型精度的对比实验证明了该方法的效率,并展示了推测机制如何使更深的参数共享解码器在最小化运行时开销的前提下实现高效训练。