While transformer models have been highly successful, they are computationally inefficient. We observe that for each layer, the full width of the layer may be needed only for a small subset of tokens inside a batch and that the "effective" width needed to process a token can vary from layer to layer. Motivated by this observation, we introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis. An ACM consists of a sequence of learners that progressively refine the output of their preceding counterparts. An additional gating mechanism determines the optimal number of learners to execute for each token. We also propose a distillation technique to replace any pre-trained model with an "ACMized" variant. Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.
翻译:尽管Transformer模型取得了巨大成功,但其计算效率较低。我们观察到,对于每个网络层,可能仅需对批次中的少量标记使用完整的层宽度,且处理不同标记所需的"有效"宽度会随层数变化。基于这一观察,我们提出了自适应计算模块(ACM)——一种通用模块,能够根据每个标记的输入难度估计动态调整其计算负载。ACM由一系列学习器组成,这些学习器会逐步优化前序学习器的输出结果。额外的门控机制会为每个标记确定最优的执行学习器数量。我们还提出了一种蒸馏技术,可将任何预训练模型替换为"ACM化"变体。在计算机视觉和语音识别任务中对Transformer模型的评估表明,用ACM替换原有层能在用户定义的宽泛计算预算范围内,显著降低推理成本且不损失下游任务精度。