The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.
翻译:传统大语言模型(LLMs)的自回归特性本质上限定了推理速度,因为词元必须顺序生成。尽管推测解码和并行解码等技术尝试缓解这一问题,但它们存在局限性:要么依赖精度较低的较小模型进行生成,要么未能充分利用基础LLM的表示能力。我们提出一种新型架构——串联Transformer来解决这些问题。该架构独特地将(1)一个小型自回归模型与(2)一个以块模式运行(同时处理多个词元)的大型模型相结合。通过允许小型模型关注大型模型更丰富的表示,其预测精度得到显著提升。在PaLM2预训练数据集上,PaLM2-Bison与PaLM2-Gecko的串联模型在下一词元预测准确率上比独立PaLM2-Gecko提升3.3%,相比下游性能相当的PaLM2-Otter模型实现1.16倍加速。我们进一步将串联模型融入推测解码(SPEED)框架,其中大型模型验证小型模型生成的词元。这使得PaLM2-Bison与PaLM2-Gecko的串联模型在保持相同下游任务精度的同时,实现显著加速(在SPEED中比使用原始PaLM2-Gecko快约1.14倍)。