High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.
翻译:高分辨率图像使神经网络能够学习更丰富的视觉表示。然而,这种性能提升是以计算复杂度不断增长为代价的,阻碍了其在延迟敏感型应用中的使用。由于并非所有像素都同等重要,跳过非重要区域的计算是一种简单有效的降低计算量的措施。然而,这在CNN中难以转化为实际的速度提升,因为它破坏了密集卷积工作负载的规律性。在本文中,我们介绍了SparseViT,它重新审视了近期基于窗口的视觉Transformer(ViT)中的激活稀疏性。由于窗口注意力自然地在块上进行批处理,窗口激活剪枝实现实际速度提升成为可能:例如,在60%稀疏度下实现约50%的延迟降低。不同层由于敏感性和计算成本不同,应分配不同的剪枝比率。我们引入了稀疏感知适应,并应用进化搜索在广阔的搜索空间中高效找到最优的逐层稀疏性配置。与密集对应模型相比,SparseViT在单目3D目标检测、2D实例分割和2D语义分割中分别实现了1.5倍、1.4倍和1.3倍的加速,同时精度损失可忽略不计甚至无损。