Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding objective, while the compositionality is implicitly imposed by the architectural or algorithmic bias in the encoder. This misalignment between auto-encoding objective and learning compositionality often results in failure of capturing meaningful object representations. In this study, we propose a novel objective that explicitly encourages compositionality of the representations. Built upon the existing object-centric learning framework (e.g., slot attention), our method incorporates additional constraints that an arbitrary mixture of object representations from two images should be valid by maximizing the likelihood of the composite data. We demonstrate that incorporating our objective to the existing framework consistently improves the objective-centric learning and enhances the robustness to the architectural choices.
翻译:学习组合表示是以对象为中心学习的关键方面,它能够实现灵活的系统性泛化并支持复杂的视觉推理。然而,现有方法大多依赖于自编码目标,而组合性是由编码器中的架构或算法偏置隐式施加的。自编码目标与学习组合性之间的这种错位常常导致无法捕捉有意义的对象表示。在本研究中,我们提出了一种明确鼓励表示组合性的新目标。基于现有的以对象为中心的学习框架(如slot attention),我们的方法加入了额外约束:来自两幅图像的对象表示的任意混合应当有效,通过最大化复合数据的似然来实现。我们证明,将该目标加入现有框架能够持续改善以对象为中心的学习,并增强对架构选择的鲁棒性。