Object-centric representations using slots have shown the advances towards efficient, flexible and interpretable abstraction from low-level perceptual features in a compositional scene. Current approaches randomize the initial state of slots followed by an iterative refinement. As we show in this paper, the random slot initialization significantly affects the accuracy of the final slot prediction. Moreover, current approaches require a predetermined number of slots from prior knowledge of the data, which limits the applicability in the real world. In our work, we initialize the slot representations with clustering algorithms conditioned on the perceptual input features. This requires an additional layer in the architecture to initialize the slots given the identified clusters. We design permutation invariant and permutation equivariant versions of this layer to enable the exchangeable slot representations after clustering. Additionally, we employ mean-shift clustering to automatically identify the number of slots for a given scene. We evaluate our method on object discovery and novel view synthesis tasks with various datasets. The results show that our method outperforms prior works consistently, especially for complex scenes.
翻译:使用槽(slots)的以对象为中心的表示在从组合场景的低层感知特征中实现高效、灵活且可解释的抽象方面展现了进展。当前方法随机初始化槽的初始状态,随后进行迭代优化。正如本文所示,随机槽初始化显著影响最终槽预测的准确性。此外,当前方法需要根据数据的先验知识预先确定槽的数量,这限制了其在现实世界中的适用性。在我们的工作中,我们利用基于感知输入特征的聚类算法来初始化槽表示。这需要在架构中增加一个层,以便根据识别出的聚类来初始化槽。我们设计了该层的排列不变和排列等变版本,以实现聚类后可交换的槽表示。此外,我们采用均值漂移聚类来自动识别给定场景中槽的数量。我们在多个数据集上评估了目标发现和新视角合成任务中的方法。结果表明,我们的方法持续优于先前的工作,尤其是在复杂场景中。