You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of inputs need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely \textbf{MuE}. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce expected inference time by up to 50\% and 40\% while maintaining 99\% and 96\% performance respectively.

翻译：大规模Transformer模型以统一架构为多种下游视觉语言任务带来了显著改进。性能提升伴随着模型规模的增大，导致推理速度缓慢并增加服务成本。虽然某些预测受益于大型模型的完整复杂度，但并非所有输入都需要相同的计算量，这可能造成计算资源浪费。为应对这一挑战，早期退出方法被提出，根据输入复杂度自适应分配计算能力以提高推理效率。现有早期退出策略通常基于中间层输出置信度作为输入复杂度的代理，以决定是否跳过后续层。然而，由于编码器中输出置信度估计的困难，这类策略无法应用于广泛使用的包含编码器和解码器的统一架构。忽略编码器组件的早期退出在节省计算能力方面并非最优。为解决该挑战，我们提出一种针对统一视觉语言模型的新型早期退出策略，允许根据输入层的相似性，通过多次早期退出同时动态跳过编码器和解码器中的层，即\textbf{MuE}。通过解耦编码器中的图像和文本模态，MuE具有灵活性，可根据模态跳过不同层，从而在最小化性能下降的同时提升推理效率。在SNLI-VE和MS COCO数据集上的实验表明，所提方法MuE可将预期推理时间分别降低高达50%和40%，同时分别保持99%和96%的性能。