Object-centric learning (OCL) aspires general and compositional understanding of scenes by representing a scene as a collection of object-centric representations. OCL has also been extended to multi-view image and video datasets to apply various data-driven inductive biases by utilizing geometric or temporal information in the multi-image data. Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do. Hence, owing to the difficulty of applying inductive biases, OCL for single-view images remains challenging, resulting in inconsistent learning of object-centric representation. To this end, we introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention. The new modules, Attention Refining Kernel (ARK) and Intermediate Point Predictor and Encoder (IPPE), respectively, prevent slots from being distracted by the background noise and indicate locations for slots to focus on to facilitate learning of object-centric representation. We also propose a weak semi-supervision approach for OCL, whilst our proposed framework can be used without any assistant annotation during the inference. Experiments show that our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets. Code is available at \url{https://github.com/object-understanding/SLASH}.
翻译:目标中心学习(OCL)旨在通过将场景表示为一系列目标中心表征的集合,实现通用且组合式的场景理解。OCL已拓展至多视角图像与视频数据集,通过利用多图像数据中的几何或时序信息施加各种数据驱动的归纳偏置。相较于视频或多视角图像,单视角图像包含的用于解耦给定场景的信息更少。因此,由于归纳偏置应用的困难性,单视角图像的OCL仍具挑战性,导致目标中心表征的学习不一致。为此,我们提出一种面向单视角图像的新型OCL框架——基于引导的槽位注意力(SLASH),该框架在槽位注意力之上包含两个简洁有效的模块。新模块——注意力精炼核(ARK)与中间点预测及编码器(IPPE)——分别防止槽位被背景噪声干扰,并指示槽位应聚焦的位置以促进目标中心表征的学习。我们还提出一种弱半监督OCL方法,而所提框架可在推理过程中无需任何辅助标注。实验表明,所提方法能够实现目标中心表征的一致性学习,并在四个数据集上取得强劲性能。代码见 https://github.com/object-understanding/SLASH。