Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called ``slots'' or ``object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
翻译:开发能够像人类认知一样有效学习以对象为中心表示的深度学习模型仍然是一项具有挑战性的任务。现有方法通过将对象表示为固定大小的向量(称为"槽"或"对象文件")来促进对象发现。尽管这些方法在某些场景中展现出潜力,但仍存在一定局限性。首先,它们依赖于架构先验,这类先验通常不可靠且需要精细的工程调控才能识别正确对象。其次,关于这些表示在下游任务中实际效用的研究明显不足。为解决第一个局限性,我们提出了一种方法,显式优化场景中每个对象应与唯一槽相关联的约束条件。通过引入具有循环性质的"一致性目标"来形式化该约束。将这些一致性目标集成到现有各类基于槽的对象中心方法后,我们在对象发现性能上取得了显著提升。这些改进在合成场景和真实世界场景中均保持一致,突显了所提方法的有效性和适应性。针对第二个局限性,我们将所提方法学习到的对象中心表示应用于两个下游强化学习任务,与传统基于槽和单一表示学习方法相比,展现出显著的性能提升。我们的结果表明,该方法不仅能改善对象发现,还能为下游任务提供更丰富的特征。