Causal discovery in the presence of missing data introduces a chicken-and-egg dilemma. While the goal is to recover the true causal structure, robust imputation requires considering the dependencies or preferably causal relations among variables. Merely filling in missing values with existing imputation methods and subsequently applying structure learning on the complete data is empirical shown to be sub-optimal. To this end, we propose in this paper a score-based algorithm, based on optimal transport, for learning causal structure from missing data. This optimal transport viewpoint diverges from existing score-based approaches that are dominantly based on EM. We project structure learning as a density fitting problem, where the goal is to find the causal model that induces a distribution of minimum Wasserstein distance with the distribution over the observed data. Through extensive simulations and real-data experiments, our framework is shown to recover the true causal graphs more effectively than the baselines in various simulations and real-data experiments. Empirical evidences also demonstrate the superior scalability of our approach, along with the flexibility to incorporate any off-the-shelf causal discovery methods for complete data.
翻译:在缺失数据存在的情况下进行因果发现会引入一个"鸡与蛋"的困境。尽管目标是恢复真实的因果结构,但鲁棒的缺失值插补需要考虑变量之间的依赖关系,最好是因果关联。实证表明,仅用现有插补方法填充缺失值,再对完整数据应用结构学习的方法并非最优。为此,本文提出一种基于最优传输的评分算法,用于从缺失数据中学习因果结构。这种最优传输视角与当前主流的基于期望最大化(EM)的评分方法有所不同。我们将结构学习转化为密度拟合问题,目标是找到能够使生成的分布与观测数据分布之间的Wasserstein距离最小化的因果模型。通过大量仿真和真实数据实验,我们的框架在多种场景下均比基线方法更有效地恢复真实因果图。实证证据还表明,该方法具有良好的可扩展性,并能灵活整合任何适用于完整数据的现成因果发现方法。