Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and low computational efficiency. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and effective approach to exploiting locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. SPC module offers superior local modeling power and performance gains, making it a promising alternative to the convolutional layer. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Extensive experiments show Caterpillar's excellent performance and scalability on both ImageNet-1K and small-scale classification benchmarks.
翻译:计算机视觉中的建模已演进至MLP。视觉MLP天然缺乏局部建模能力,对此最简单的处理方式是结合卷积层。以滑动窗口机制著称的卷积层,同时也受限于该机制带来的冗余与低计算效率问题。本文旨在摒弃窗口机制,引入一种更为精细且有效的利用局部性的方法。为此,我们提出一种新型MLP模块——移位柱级联(SPC),该模块包含两步处理:(1)柱移位,通过沿四个方向移动输入图像生成四个邻域特征图;(2)柱级联,对特征图应用线性变换并拼接以聚合局部特征。SPC模块展现出卓越的局部建模能力和性能增益,有望成为卷积层的替代方案。随后,我们在sMLPNet混合模型中将卷积层替换为SPC模块,构建了名为Caterpillar的纯MLP架构。大量实验表明,Caterpillar在ImageNet-1K及小规模分类基准上均具有优异的性能和可扩展性。