Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with two state-of-the-art encoder-decoder transformer models on various VL tasks. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
翻译:摘要:编码器-解码器Transformer模型在多种视觉-语言任务中取得了显著成功,但存在推理延迟高的问题。由于自回归解码特性,解码器通常占据大部分延迟。为加速推理,我们提出了一种解码器动态早停方法(DEED)。我们构建了一个具有多出口结构的编码器-解码器Transformer模型,通过深度监督训练使每个解码器层均能生成合理预测。此外,我们采用简单实用的技术(包括共享生成头与适配模块)来保持浅层解码器出口的精度。基于该多出口模型,我们在推理阶段实现逐步骤动态早停:模型可根据当前解码步骤中各层输出的置信度,主动使用更少的解码器层。针对不同解码步骤可能使用的解码器层数差异,我们采用即时计算机制生成先前解码步骤所需的深层解码器特征,确保跨解码步骤的特征语义对齐。我们基于两种先进的编码器-解码器Transformer模型在多种VL任务上评估了该方法。实验表明,与基线相比,本方法在保持相近甚至更高精度的前提下,可将整体推理延迟降低30%-60%。