Learning structured representations of the visual world in terms of objects promises to significantly improve the generalization abilities of current machine learning models. While recent efforts to this end have shown promising empirical progress, a theoretical account of when unsupervised object-centric representation learning is possible is still lacking. Consequently, understanding the reasons for the success of existing object-centric methods as well as designing new theoretically grounded methods remains challenging. In the present work, we analyze when object-centric representations can provably be learned without supervision. To this end, we first introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility. Under this generative process, we prove that the ground-truth object representations can be identified by an invertible and compositional inference model, even in the presence of dependencies between objects. We empirically validate our results through experiments on synthetic data. Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models' compositionality and invertibility and their empirical identifiability.
翻译:以对象的方式学习视觉世界的结构化表示有望显著提高当前机器学习模型的泛化能力。尽管近期在此方向上的努力已显示出有希望的实证进展,但关于无监督对象中心表示学习是否可行的理论解释仍然缺失。因此,理解现有以对象为中心方法成功的原因以及设计新的理论驱动方法仍具挑战。在本文中,我们分析了何时可以在无监督情况下可证明地学习以对象为中心的表示。为此,我们首先对由多个对象组成的场景生成过程引入两个假设,称为组合性和不可约性。在此生成过程下,我们证明即使对象间存在依赖关系,也可通过可逆且组合的推理模型识别出真实的对象表示。我们通过合成数据实验验证了理论结果。最后,我们通过展示模型组合性与可逆性及其实证可辨识性之间的紧密对应关系,提供了该理论对现有以对象为中心模型具有预测能力的证据。