Causal inference from observational data plays critical role in many applications in trustworthy machine learning. While sound and complete algorithms exist to compute causal effects, many of them assume access to conditional likelihoods, which is difficult to estimate for high-dimensional (particularly image) data. Researchers have alleviated this issue by simulating causal relations with neural models. However, when we have high-dimensional variables in the causal graph along with some unobserved confounders, no existing work can effectively sample from the un/conditional interventional distributions. In this work, we show how to sample from any identifiable interventional distribution given an arbitrary causal graph through a sequence of push-forward computations of conditional generative models, such as diffusion models. Our proposed algorithm follows the recursive steps of the existing likelihood-based identification algorithms to train a set of feed-forward models, and connect them in a specific way to sample from the desired distribution. We conduct experiments on a Colored MNIST dataset having both the treatment ($X$) and the target variables ($Y$) as images and sample from $P(y|do(x))$. Our algorithm also enables us to conduct a causal analysis to evaluate spurious correlations among input features of generative models pre-trained on the CelebA dataset. Finally, we generate high-dimensional interventional samples from the MIMIC-CXR dataset involving text and image variables.
翻译:从观测数据中进行因果推断在可信机器学习领域的许多应用中发挥着关键作用。虽然存在完备的算法来计算因果效应,但其中许多算法假设能够获取条件似然,这对于高维(尤其是图像)数据而言难以估计。研究人员通过使用神经模型模拟因果关系来缓解此问题。然而,当因果图中存在高维变量以及部分未观测混杂因子时,现有方法均无法有效从未/条件干预分布中采样。本研究表明,通过一系列条件生成模型(如扩散模型)的前向映射计算,如何从任意给定因果图的可识别干预分布中采样。所提算法遵循现有基于似然的识别算法的递归步骤,训练一组前馈模型,并以特定方式连接它们以从目标分布中采样。我们在一个彩色MNIST数据集上进行实验,该数据集的治疗变量($X$)与目标变量($Y$)均为图像,并从$P(y|do(x))$中采样。该算法还使我们能够进行因果分析,以评估在CelebA数据集上预训练的生成模型输入特征间的伪相关性。最后,我们从涉及文本和图像变量的MIMIC-CXR数据集中生成了高维干预样本。