Identifying variables responsible for changes to a biological system enables applications in drug target discovery and cell engineering. Given a pair of observational and interventional datasets, the goal is to isolate the subset of observed variables that were the targets of the intervention. Directly applying causal discovery algorithms is challenging: the data may contain thousands of variables with as few as tens of samples per intervention, and biological systems do not adhere to classical causality assumptions. We propose a causality-inspired approach to address this practical setting. First, we infer noisy causal graphs from the observational and interventional data. Then, we learn to map the differences between these graphs, along with additional statistical features, to sets of variables that were intervened upon. Both modules are jointly trained in a supervised framework, on simulated and real data that reflect the nature of biological interventions. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets. We also demonstrate significant improvements over current causal discovery methods for predicting soft and hard intervention targets across a variety of synthetic data.
翻译:识别导致生物系统变化的变量,在药物靶点发现和细胞工程领域具有重要应用价值。给定一对观测数据集和干预数据集,本研究的目标是分离出作为干预目标的观测变量子集。直接应用因果发现算法面临诸多挑战:数据可能包含数千个变量而每个干预仅有数十个样本,且生物系统往往不满足经典因果假设。为此,我们提出一种因果启发的创新方法来解决这一实际问题。首先,我们从观测数据和干预数据中推断带噪声的因果图;随后,通过学习这些因果图之间的差异并结合附加统计特征,将其映射到受干预的变量集合。这两个模块在监督学习框架下进行联合训练,所用模拟和真实数据均能反映生物干预的本质特性。该方法在七个单细胞转录组数据集上的扰动建模任务中持续优于基线模型。此外,我们在多种合成数据上验证了该方法在预测软干预和硬干预目标方面,较现有因果发现方法具有显著优势。