何时跳过哪些层：利用残差门学习在大型语言模型中跳过计算 (What Layers When: Learning to Skip Compute in LLMs with Residual Gates)

We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. For increasingly larger models, this tradeoff improves drastically. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

翻译：我们提出了GateSkip，一种简单的残差流门控机制，能够在仅解码器语言模型中实现基于词元的层跳过。每个注意力/多层感知机分支都配备了一个Sigmoid线性门，该门在分支输出重新进入残差流之前对其进行压缩。在推理过程中，我们根据门值对词元进行排序，并依据每层预算跳过低重要性词元。虽然已知基于提前退出或路由器的混合深度模型存在不稳定性且需要大量重新训练，但我们提出的平滑可微分门能够在预训练模型之上稳定地进行微调。在长文本推理任务中，我们节省了高达15%的计算量，同时保持了超过90%的基线准确率。对于规模持续增大的模型，这种权衡效果显著提升。在指令微调模型中，我们在全计算量下观察到准确率提升，并在节省近50%计算量时仍能匹配基线质量。学习得到的门机制揭示了Transformer的信息流动规律（例如BOS词元起到锚定作用），且该方法可与量化、剪枝及自推测解码等技术轻松结合。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日