Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.

翻译：在本文中，我们提出了一种简单而有效的自监督视频对象分割方法（VOS）。我们的关键洞察在于，DINO预训练Transformer中固有的结构依赖性可被利用以建立视频中稳健的时空对应关系。此外，基于该对应线索的简单聚类足以产生具有竞争力的分割结果。以往的自监督VOS技术主要依赖辅助模态或利用迭代槽注意力辅助对象发现，这限制了其通用性并增加了计算需求。为应对这些挑战，我们开发了一种简化架构，该架构充分利用DINO预训练Transformer中涌现的对象性，无需额外模态或槽注意力。具体而言，我们首先引入单个时空Transformer模块处理逐帧DINO特征，并以自注意力的形式建立时空依赖关系。随后，利用这些注意力图，我们实施层次聚类以生成对象分割掩码。为了以全自监督方式训练时空模块，我们结合语义与动态运动一致性以及熵归一化。我们的方法在多个无监督VOS基准上展现出最先进的性能，尤其在DAVIS-17-Unsupervised和YouTube-VIS-19等复杂真实世界多对象视频分割任务中表现突出。代码与模型检查点将发布于 https://github.com/shvdiwnkozbw/SSL-UVOS。