Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as $d$-dimensional vectors, and (2) that the observations are the output of some injective generative function of these latent variables. While these assumptions appear benign, we show that when the observations are of multiple objects, the generative function is no longer injective and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture arXiv:2006.15055, we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object's properties. This approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.
翻译:因果表征学习已在多种场景中证明了其能够以可识别性保证(直至某个合理等价类)解耦潜在变量。这些方法的共同假设是:(1)潜在变量表示为$d$维向量;(2)观测结果是这些潜在变量的某个单射生成函数的输出。尽管这些假设看似温和,但我们证明当观测对象为多实体时,生成函数不再保持单射性,导致解耦在实践中失效。通过结合面向对象学习与因果表征学习的最新进展,我们能够解决这一失效问题。通过对Slot Attention架构(arXiv:2006.15055)进行改进,我们开发了一种面向对象架构,该架构利用稀疏扰动的弱监督信号来解耦每个对象的属性。这种方法在数据效率上更优——其所需扰动数量显著少于采用欧几里得空间编码的同类方法。我们通过一系列基于图像的简单解耦实验证明,该方法成功实现了多对象属性集的有效解耦。