Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} [email protected]).
翻译:最近,自监督学习技术(如掩码自编码器,MAE)的进步极大影响了图像和视频的视觉表征学习。然而值得注意的是,现有掩码图像/视频建模的主流方法过度依赖资源密集型的视觉Transformer(ViT)作为特征编码器。本文提出一种名为\textbf{VideoMAC}的新方法,将视频掩码自编码器与资源友好的卷积网络(ConvNets)相结合。具体而言,VideoMAC对随机采样的视频帧对采用对称掩码策略。为防止掩码模式消散问题,我们采用通过稀疏卷积算子实现的ConvNets作为编码器。同时,我们提出一种简洁而有效的掩码视频建模(MVM)方法——一种由在线编码器和指数移动平均目标编码器构成的双编码器架构,旨在促进视频中帧间重建一致性。此外,我们证明VideoMAC能使经典(ResNet)/现代(ConvNeXt)卷积编码器充分利用MVM的优势,在视频目标分割(+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$)、身体部位传播(+\textbf{6.3\%} / \textbf{3.1\%} mIoU)及人体姿态追踪(+\textbf{10.2\%} / \textbf{11.1\%} [email protected])等下游任务中均优于基于ViT的方法。