Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at https://github.com/showlab/sparseformer
翻译:人类视觉识别是一个稀疏过程,仅关注少量显著视觉线索,而非均匀遍历所有细节。然而,当前大多数视觉网络遵循密集范式,以统一方式处理每个视觉单元(例如像素或图像块)。本文挑战了这种密集范式,提出了一种名为SparseFormer的新方法,以端到端方式模仿人类的稀疏视觉识别。SparseFormer通过稀疏特征采样过程,在隐空间中利用极少量令牌(低至49个)学习表示图像,而非处理原始像素空间中的密集单元。因此,SparseFormer规避了图像空间中的大部分密集操作,显著降低了计算成本。在ImageNet分类基准数据集上的实验表明,SparseFormer在取得与经典或成熟模型相当性能的同时,实现了更优的精度-吞吐量平衡。此外,所提出的网络设计可轻松扩展至视频分类任务,以较低计算成本获得有竞争力的性能。我们期望本工作能为视觉建模提供一种替代方案,并激发对稀疏神经架构的进一步研究。代码将开源至https://github.com/showlab/sparseformer。