Causal representation learning has emerged as the center of action in causal machine learning research. In particular, multi-domain datasets present a natural opportunity for showcasing the advantages of causal representation learning over standard unsupervised representation learning. While recent works have taken crucial steps towards learning causal representations, they often lack applicability to multi-domain datasets due to over-simplifying assumptions about the data; e.g. each domain comes from a different single-node perfect intervention. In this work, we relax these assumptions and capitalize on the following observation: there often exists a subset of latents whose certain distributional properties (e.g., support, variance) remain stable across domains; this property holds when, for example, each domain comes from a multi-node imperfect intervention. Leveraging this observation, we show that autoencoders that incorporate such invariances can provably identify the stable set of latents from the rest across different settings.
翻译:因果表征学习已成为因果机器学习研究的核心焦点。特别是,多领域数据集为展示因果表征学习相对于标准无监督表征学习的优势提供了天然机遇。虽然近期研究在因果表征学习方面取得了关键进展,但它们往往因对数据过于简化的假设(例如每个领域来自不同的单节点完美干预)而缺乏对多领域数据集的适用性。本研究放宽了这些假设,并利用以下观察结果:通常存在一组潜变量,其某些分布特性(如支撑集、方差)在不同领域间保持稳定;例如当每个领域来自多节点非完美干预时,这一性质成立。基于这一观察,我们证明融入此类不变性的自编码器能够在不同设定下,从其余潜变量中可证明地识别出稳定的潜变量集。