Causal discovery aims to infer causal relationships among variables from observational data, typically represented by a directed acyclic graph (DAG). Most existing methods assume independent and identically distributed observations, an assumption often violated in modern applications. In addition, many datasets contain a mixture of continuous and discrete variables, which further complicates causal modeling when dependence across samples is present. To address these challenges, we propose a de-correlation framework for causal discovery from dependent mixed data. Our approach formulates a structural equation model with latent variables that accommodates both continuous and discrete variables while allowing correlated Gaussian errors across units. We estimate the dependence structure among samples via a pairwise maximum likelihood estimator for the covariance matrix and develop an EM algorithm to impute latent variables underlying discrete observations. A de-correlation transformation of the recovered latent data enables the use of standard causal discovery algorithms to learn the underlying causal graph. Simulation studies demonstrate that the proposed method substantially improves causal graph recovery compared with applying standard methods directly to the original dependent data. We apply our approach to single-cell RNA sequencing data to infer gene regulatory networks governing embryonic stem cell differentiation. The inferred regulatory networks show significantly improved predictive likelihood on test data, and many high-confidence edges are supported by known regulatory interactions reported in the literature.
翻译:摘要:因果发现旨在从观测数据中推断变量间的因果关系,通常用有向无环图(DAG)表示。现有方法大多假设观测数据独立同分布,然而这一假设在现代应用中常被违背。此外,许多数据集包含连续变量与离散变量的混合类型,当样本间存在依赖关系时,这种混合特性进一步增加了因果建模的复杂性。针对这些挑战,我们提出了一种针对依赖混合数据的去相关因果发现框架。该方法构建了含潜变量的结构方程模型,既能处理连续与离散变量,又允许不同样本单元间存在相关高斯误差。我们通过成对最大似然估计协方差矩阵来估算样本间的依赖结构,并开发了期望最大化(EM)算法对离散观测背后的潜变量进行补全。通过对恢复后的潜变量数据进行去相关变换,可借助标准因果发现算法学习底层因果图。模拟实验表明,与直接对原始依赖数据应用标准方法相比,所提方法显著提升了因果图的恢复效果。我们进一步将该方法应用于单细胞RNA测序数据,用以推断调控胚胎干细胞分化的基因调控网络。推断得到的调控网络在测试数据上展现出明显更高的预测似然值,且许多高置信度的边与文献中已知的调控互作关系一致。