Emerging single-cell technologies that combine CRISPR-based genetic perturbations with single-cell RNA sequencing, such as Perturb-seq, offer unprecedented opportunities to uncover cause-and-effect relationships among genes. Nonetheless, Perturb-seq experiments are subject to unobserved factors that, if not properly handled, can severely bias the inferred causal relationships between genes. These latent factors may arise not only from intrinsic molecular features of the regulatory elements, but also from unmeasured genes omitted due to cost-constrained experimental designs. Although methods for analyzing large-scale Perturb-seq data are rapidly maturing, approaches that explicitly account for such unobserved confounders when inferring causal gene networks are still lacking. Here, we propose a novel approach to accurately reconstruct causal gene networks from Perturb-seq data even when important confounders are missing. Our framework leverages proxy and instrumental variable strategies to exploit the rich information embedded in the perturbations, enabling unbiased estimation of the underlying directed acyclic graph (DAG) of gene expression. Applications to both comprehensive synthetic data and real CRISPR interference experiments in K562 cells demonstrate that our method outperforms baseline approaches that lack principled adjustments for unmeasured confounding, yielding more accurate and biologically relevant recovery of the true causal DAGs.
翻译:新兴的单细胞技术(如Perturb-seq)将基于CRISPR的遗传扰动与单细胞RNA测序相结合,为揭示基因间的因果关系提供了前所未有的机遇。然而,Perturb-seq实验易受未观测因素影响,若处理不当,这些因素会严重扭曲所推断的基因间因果关联。这些潜在因素不仅可能源于调控元件的内在分子特征,还可能来自因成本受限实验设计而遗漏的未测量基因。尽管分析大规模Perturb-seq数据的方法正快速成熟,但在推断因果基因网络时明确考虑此类未观测混杂因素的方法仍属空白。本文提出一种新方法,即使关键混杂因素缺失,也能从Perturb-seq数据中精确重建因果基因网络。我们的框架利用代理变量和工具变量策略,充分挖掘嵌入在扰动中的丰富信息,从而实现对基因表达潜在有向无环图(DAG)的无偏估计。对综合合成数据与K562细胞真实CRISPR干扰实验的应用表明,本方法优于缺乏对未测量混杂因素进行系统性校正的基线方法,可更准确且更具生物学意义地恢复真实因果DAG。