Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.
翻译:从视频中进行无监督的物体中心学习,是从大规模无标注视频集合中提取结构化表征的一种前景广阔的方法。为支持自主控制等下游任务,这些表征必须具备组合性与时间一致性。现有基于循环处理的方法因其训练目标未强制时间一致性,常缺乏跨帧的长期稳定性。本研究提出一种新颖的物体级时间对比损失函数,用于视频物体中心模型,显式提升时间一致性。我们的方法显著改善了所学物体中心表征的时间一致性,产生更可靠的视频分解结果,从而促进无监督物体动态预测等挑战性下游任务。此外,该损失函数引入的归纳偏置大幅提升了物体发现性能,在合成与真实数据集上均取得最先进的结果,其表现甚至优于利用运动掩码作为附加线索的弱监督方法。