Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2$\times$ speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.
翻译:自回归大型语言模型(LLMs)在语言任务中取得了令人瞩目的性能,但面临两个显著瓶颈:(1)注意力模块随令牌数量增加呈现二次复杂度;(2)自回归LLMs在生成过程中因顺序处理特性导致效率受限。尽管线性注意力与推测解码提供了潜在的解决方案,但它们对于增强自回归LLMs的适用性及协同潜力仍不明确。我们首次对现有线性注意力方法在自回归LLMs中的效能进行了全面研究,并将其与推测解码相结合。我们提出了一种线性注意力的增强技术,确保其与推测解码兼容,从而实现更高效的LLM训练与服务。通过对七种现有线性注意力模型和五种基于编码器/解码器的LLMs进行广泛实验与消融研究,结果一致验证了我们增强型线性化LLMs的有效性。值得注意的是,相较于先前的线性注意力方法,我们的方法在LLaMA模型上实现了高达6.67倍的困惑度降低,并在生成过程中获得了高达2$\times$的加速。代码与模型发布于 https://github.com/GATECH-EIC/Linearized-LLM。