Unsupervised localization and segmentation are long-standing robot vision challenges that describe the critical ability for an autonomous robot to learn to decompose images into individual objects without labeled data. These tasks are important because of the limited availability of dense image manual annotation and the promising vision of adapting to an evolving set of object categories in lifelong learning. Most recent methods focus on using visual appearance continuity as object cues by spatially clustering features obtained from self-supervised vision transformers (ViT). In this work, we leverage motion cues, inspired by the common fate principle that pixels that share similar movements tend to belong to the same object. We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other if their corresponding spatial locations share similar movements, and vice versa. We use the proposed loss function to finetune vision transformers that were originally trained on static images. Our fine-tuning procedure outperforms state-of-the-art techniques for unsupervised semantic segmentation through linear probing, without the use of any labeled data. This procedure also demonstrates increased performance over original ViT networks across unsupervised object localization and semantic segmentation benchmarks.
翻译:无监督定位与分割是机器人视觉领域的长期挑战,描述了自主机器人在无标注数据条件下学习将图像分解为独立物体的关键能力。这些任务之所以重要,源于密集图像人工标注的稀缺性,以及适应终身学习中不断演变的物体类别集合的前景。现有方法大多通过空间聚类自监督视觉Transformer(ViT)提取的特征,利用视觉外观连续性作为物体线索。本研究基于共同命运原则(共享相似运动的像素倾向于属于同一物体)启发,利用运动线索。我们提出一种新的损失项公式,在无标注视频中引入光流,促使自监督ViT特征在对应空间位置共享相似运动时相互靠近,反之则相互远离。我们使用所提出的损失函数对最初在静态图像上训练的视觉Transformer进行微调。通过线性探测,该微调流程在无任何标注数据的情况下,在无监督语义分割任务中超越了现有技术水平。在无监督物体定位与语义分割基准上,该流程相比原始ViT网络也展现出性能提升。