Tandem Transformers for Inference Efficient LLMs

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.

翻译：传统大语言模型（LLMs）的自回归特性本质上限定了推理速度，因为词元必须顺序生成。尽管推测解码和并行解码等技术尝试缓解这一问题，但它们存在局限性：要么依赖精度较低的较小模型进行生成，要么未能充分利用基础LLM的表示能力。我们提出一种新型架构——串联Transformer来解决这些问题。该架构独特地将(1)一个小型自回归模型与(2)一个以块模式运行（同时处理多个词元）的大型模型相结合。通过允许小型模型关注大型模型更丰富的表示，其预测精度得到显著提升。在PaLM2预训练数据集上，PaLM2-Bison与PaLM2-Gecko的串联模型在下一词元预测准确率上比独立PaLM2-Gecko提升3.3%，相比下游性能相当的PaLM2-Otter模型实现1.16倍加速。我们进一步将串联模型融入推测解码（SPEED）框架，其中大型模型验证小型模型生成的词元。这使得PaLM2-Bison与PaLM2-Gecko的串联模型在保持相同下游任务精度的同时，实现显著加速（在SPEED中比使用原始PaLM2-Gecko快约1.14倍）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日