Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.
翻译:无监督视频中的对象为中心学习是从大规模无标注视频集合中学习结构化表示的重要途径,但先前方法仅能在受限领域的真实世界数据集上实现扩展。近期研究表明,对预训练自监督特征的重建能在无约束的真实世界图像数据集上生成以对象为中心的表示。基于该方法,我们提出了一种利用此类预训练特征的新范式——时序特征相似性损失函数。该损失函数编码了图像块间的语义及时序相关性,是引入运动偏差进行对象发现的自然方式。实验证明,该损失函数在具有挑战性的合成MOVi数据集上取得了最优性能。当与特征重建损失联合使用时,我们的模型成为首个可扩展至YouTube-VIS等无约束视频数据集的以对象为中心的视频模型。