Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

from arxiv, Accepted to T-PAMI. Journal version of our NeurIPS 2021 work: arXiv:2106.02034. Code is available at https://github.com/raoyongming/DynamicViT

In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks that require structured feature maps by formulating a more generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and using more expressive slow paths to more important locations, we can maintain the structure of feature maps while significantly reducing the overall computations. Extensive experiments demonstrate the effectiveness of our framework on various modern architectures and different visual recognition tasks. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT

翻译：本文提出一种利用视觉数据空间稀疏性实现模型加速的新方法。我们观察到，视觉Transformer的最终预测仅依赖于最具信息量的部分令牌子集，这足以完成准确的图像识别。基于这一发现，我们提出动态令牌稀疏化框架，根据输入渐进且动态地剪枝冗余令牌，以加速视觉Transformer。具体地，我们设计了一个轻量级预测模块，基于当前特征估计每个令牌的重要性得分。该模块被添加到不同层中以分层剪枝冗余令牌。尽管该框架源于对视觉Transformer中稀疏注意力的观察，但自适应与非对称计算的思想可作为加速各类架构的通用解决方案。我们将方法拓展至层次化模型（包括CNN和层次化视觉Transformer），以及需要结构化特征图的复杂密集预测任务，通过构建更通用的动态空间稀疏化框架，对不同空间位置实施渐进稀疏化和非对称计算。通过为低信息量特征配置轻量快速路径，为重要区域采用高表达慢速路径，我们既保持了特征图结构又显著降低整体计算量。大量实验表明，该框架在多种现代架构及不同视觉识别任务中均具有效性。结果清晰证明，动态空间稀疏化为模型加速提供了新颖且更有效的维度。代码已开源：https://github.com/raoyongming/DynamicViT