Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction ($95\%$) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.
翻译:在计算机视觉中,建立图像或场景间的对应关系是一项重大挑战,尤其是在存在遮挡、视角变化及目标外观差异的情况下。本文提出孪生掩码自编码器(SiamMAE),这是掩码自编码器(MAE)的一个简单扩展,用于从视频中学习视觉对应关系。SiamMAE 对随机采样的视频帧对进行非对称掩码处理。这些帧由编码器网络独立处理,而由一系列交叉注意力层组成的解码器则负责预测未来帧中的缺失图像块。通过在未来帧中掩码大部分(95%)图像块,同时保留过去帧不变,SiamMAE 鼓励网络聚焦于目标运动,并学习以目标为中心的表征。尽管概念简单,但通过 SiamMAE 学习到的特征在视频目标分割、姿态关键点传播及语义部件传播任务中优于最先进的自监督方法。SiamMAE 无需依赖数据增强、手工设计的基于跟踪的预文本任务或其他防止表征坍缩的技术,即可取得具有竞争力的结果。