In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP's positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released at https://github.com/zhouds1918/PosMLP_Video.
翻译:近年来,视觉Transformer与MLP在图像理解任务中展现出卓越性能。然而,其固有的密集计算算子(如自注意力与令牌混合层)在处理时空视频数据时面临显著挑战。为弥补这一不足,我们提出PosMLP-Video——一种轻量而强大的类MLP视频识别骨干网络。本方法摒弃密集算子,转而采用高效的相对位置编码构建成对令牌关系,利用小规模参数化的相对位置偏置获取每个关系得分。具体而言,为实现时空建模,我们将图像PosMLP的位置门控单元扩展为时序、空间及时空三种变体,分别命名为PoTGU、PoSGU与PoSTGU。这些门控单元可灵活组合为三种时空分解式位置MLP模块,在降低模型复杂度的同时保持优异性能。此外,我们通过通道分组技术丰富了相对位置关系。在三个视频相关任务上的实验结果表明,相较于先前最优模型,PosMLP-Video实现了具有竞争力的速度-精度平衡。特别地,在ImageNet1K上预训练的PosMLP-Video在Something-Something V1/V2数据集上分别达到59.0%/70.3%的Top-1准确率,在Kinetics-400上达到82.1%的Top-1准确率,且所需参数量与FLOPs远低于其他模型。代码已发布于https://github.com/zhouds1918/PosMLP_Video。