Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.
翻译:推测解码(SD)是一种广泛采用的加速大语言模型(LLM)推理的方法,尤其在草稿模型与目标模型高度对齐时效果显著。然而,最先进的SD方法通常依赖于紧密耦合的、基于自注意力的Transformer解码器,并常辅以辅助池化或融合层。这种耦合使得它们日益复杂,且难以在不同模型间泛化。我们提出了Budget EAGLE(Beagle),据我们所知,这是首个基于交叉注意力的Transformer解码器SD模型,其性能与领先的自注意力SD模型(EAGLE-v2)相当,同时消除了对池化或辅助组件的需求,简化了架构,提高了训练效率,并在训练时模拟中保持了稳定的内存使用。为了有效训练这一新颖架构,我们提出了两阶段块注意力训练,这是一种在块级注意力场景下实现训练稳定性和收敛效率的新方法。在多个LLM和数据集上的大量实验表明,Beagle实现了具有竞争力的推理加速和比EAGLE-v2更高的训练效率,为推测解码架构提供了一个强有力的替代方案。