Emerging single-cell technologies that integrate CRISPR-based genetic perturbations with single-cell RNA sequencing, such as Perturb-seq, have substantially advanced our understanding of gene regulation and causal influence of genes. While Perturb-seq data provide valuable causal insights into gene-gene interactions, statistical concerns remain regarding unobserved confounders that may bias inference. These latent factors may arise not only from intrinsic molecular features of regulatory elements encoded in Perturb-seq experiments, but also from unobserved genes arising from cost-constrained experimental designs. Although methods for analyzing largescale Perturb-seq data are rapidly maturing, approaches that explicitly account for such unobserved confounders in learning the causal gene networks are still lacking. Here, we propose a novel method to recover causal gene networks from Perturb-seq experiments with robustness to arbitrarily omitted confounders. Our framework leverages proxy and instrumental variable strategies to exploit the rich information embedded in perturbations, enabling unbiased estimation of the underlying directed acyclic graph (DAG) of gene expressions. Simulation studies and analyses of CRISPR interference experiments of K562 cells demonstrate that our method outperforms baseline approaches that ignore unmeasured confounding, yielding more accurate and biologically relevant recovery of the true gene causal DAGs.
翻译:整合CRISPR基因扰动与单细胞RNA测序的新兴单细胞技术(如Perturb-seq)极大推进了我们对基因调控及基因因果影响的理解。尽管Perturb-seq数据为基因间相互作用提供了宝贵的因果洞察,但关于未观测混杂变量可能干扰推断的统计问题依然存在。这些潜在因素不仅源于Perturb-seq实验中编码调控元件的内在分子特征,也可能来自成本约束实验设计中未观测的基因。虽然大规模Perturb-seq数据分析方法正日趋成熟,但在学习因果基因网络时明确考虑此类未观测混杂变量的方法仍属欠缺。本文提出一种新颖方法,可在任意遗漏混杂变量的情况下稳健地从Perturb-seq实验中恢复因果基因网络。该框架利用代理变量与工具变量策略,挖掘扰动中蕴含的丰富信息,实现对基因表达基础有向无环图(DAG)的无偏估计。通过仿真研究及对K562细胞CRISPR干扰实验的分析表明,本方法优于忽略未测量混杂效应的基线方法,能更准确且更具生物学相关性地恢复真实的基因因果DAG。