The discovery of causal relationships from high-dimensional data is a major open problem in bioinformatics. Machine learning and feature attribution models have shown great promise in this context but lack causal interpretation. Here, we show that a popular feature attribution model estimates a causal quantity reflecting the influence of one variable on another, under certain assumptions. We leverage this insight to implement a new tool, CIMLA, for discovering condition-dependent changes in causal relationships. We then use CIMLA to identify differences in gene regulatory networks between biological conditions, a problem that has received great attention in recent years. Using extensive benchmarking on simulated data sets, we show that CIMLA is more robust to confounding variables and is more accurate than leading methods. Finally, we employ CIMLA to analyze a previously published single-cell RNA-seq data set collected from subjects with and without Alzheimer's disease (AD), discovering several potential regulators of AD.
翻译:从高维数据中发现因果关系是生物信息学中的一个重大开放问题。机器学习与特征归因模型在此背景下展现出巨大潜力,但缺乏因果解释性。本文证明,在特定假设下,一种流行的特征归因模型所估计的因果量能够反映变量间的相互影响。我们利用这一发现实现了新工具CIMLA,用于发现因果关系中条件依赖的变化。随后,我们应用CIMLA识别生物条件间基因调控网络的差异——这一问题近年来备受关注。通过对模拟数据集进行广泛基准测试,我们证明CIMLA比主流方法对混杂变量具有更强的鲁棒性且准确性更高。最后,我们将CIMLA应用于分析一项先前发表的、包含阿尔茨海默病(AD)患者与非患者的单细胞RNA-seq数据集,发现了多个潜在的AD调控因子。