Despite their simpler information fusion designs compared with Vision Transformers and Convolutional Neural Networks, Vision MLP architectures have demonstrated strong performance and high data efficiency in recent research. However, existing works such as CycleMLP and Vision Permutator typically model spatial information in equal-size spatial regions and do not consider cross-scale spatial interactions. Further, their token mixers only model 1- or 2-axis correlations, avoiding 3-axis spatial-channel mixing due to its computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for spatial-channel mixing through cross-scale local and global aggregation. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
翻译:尽管与Vision Transformers和卷积神经网络相比,视觉MLP架构在信息融合设计上更为简单,但近年研究已展示其强大的性能和高数据效率。然而,现有工作(如CycleMLP和Vision Permutator)通常对等尺寸空间区域建模空间信息,未考虑跨尺度的空间交互。此外,其令牌混合器仅建模1轴或2轴相关性,因计算成本而避免3轴的通道-空间混合。为此,我们提出CS-Mixer,一种通过跨尺度局部与全局聚合学习动态低秩变换以实现空间-通道混合的分层视觉MLP。所提方法在常见图像识别基准上取得有竞争力的结果,且未显著增加计算量。我们最大的模型CS-Mixer-L在ImageNet-1k上达到83.2%的top-1准确率,计算量为13.7 GFLOPs,参数量为94 M。