Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs and outputs usually imply more tokens involved in computations. Directly removing the less attentive tokens has been discussed for the image classification task but can not be extended to semantic segmentation since a dense prediction is required for every patch. To this end, this work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation. Motivated by the coarse-to-fine segmentation process by humans, we naturally split the widely adopted auxiliary-loss-based network architecture into several stages, where each auxiliary block grades every token's difficulty level. We can finalize the prediction of easy tokens in advance without completing the entire forward pass. Moreover, we keep $k$ highest confidence tokens for each semantic category to uphold the representative context information. Thus, computational complexity will change with the difficulty of the input, akin to the way humans do segmentation. Experiments suggest that the proposed DToP architecture reduces on average $20\% - 35\%$ of computational cost for current semantic segmentation methods based on plain vision transformers without accuracy degradation.
翻译:视觉变换器在多种视觉任务上取得了领先性能,但计算复杂度仍然较高。在语义分割等密集预测任务中,情况更为严峻,因为高分辨率输入和输出通常意味着更多令牌参与计算。直接移除注意力较低的令牌已在图像分类任务中得到讨论,但由于每个图像块都需要密集预测,该方法无法推广至语义分割。为此,本文提出一种基于令牌提前退出的动态令牌剪枝(DToP)方法,用于语义分割。受人类由粗到细的分割过程启发,我们将广泛采用的基于辅助损失的网络架构自然划分为多个阶段,每个辅助模块对每个令牌的难度级别进行评分。我们可以在不完成整个前向传播的情况下提前完成简单令牌的预测。此外,我们为每个语义类别保留k个最高置信度令牌,以维持代表性上下文信息。因此,计算复杂度将根据输入的难度动态变化,类似于人类的分割方式。实验表明,所提出的DToP架构在基于朴素视觉变换器的现有语义分割方法中,平均降低20%至35%的计算成本,且不造成精度下降。