Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, they may produce suboptimal object encodings for downstream tasks. To overcome this, we propose to exploit object motion and continuity, i.e., objects do not pop in and out of existence. This is accomplished through two mechanisms: (i) providing priors on the location of objects through integration of optical flow, and (ii) a contrastive object continuity loss across consecutive image frames. Rather than developing an explicit deep architecture, the resulting Motion and Object Continuity (MOC) scheme can be instantiated using any baseline object detection model. Our results show large improvements in the performances of a SOTA model in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream tasks, moving beyond object representation learning based only on reconstruction.
翻译:近期无监督多对象检测模型在性能上取得了显著提升,主要归功于新颖的架构归纳偏置。然而,这些模型可能在下游任务中产生次优的对象编码。为解决这一问题,我们提出利用对象运动与连续性的先验知识,即对象不会凭空出现或消失。这通过两种机制实现:(i)通过集成光流提供对象位置先验,(ii)在连续图像帧间施加对比式对象连续性损失。我们提出的运动与对象连续性(MOC)方案无需开发显式深度架构,即可基于任何基线对象检测模型实例化。实验表明,该方法在对象发现、收敛速度和潜在对象表征质量(尤其在Atari游戏任务中)上显著提升了当前最优模型的性能。总体而言,我们证明了在重建式对象表示学习基础上,融合运动与对象连续性对下游任务具有明确优势。