When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data

Despite theoretical advantages, causal methods for Gene Regulatory Network (GRN) inference from single-cell RNA-seq data consistently fail to match or outperform correlation-based baselines in many realistic benchmarks, a persistent puzzle which casts doubt on the value of causality for this task. We argue that existing benchmarks are insufficiently controlled to answer this question because they evaluate on real or semi-real data where multiple pathologies co-occur, confounding failure modes, and obscuring the specific conditions under which different inference methods excel or fail. To address this gap, we introduce a controlled diagnostic framework that isolates seven biologically motivated pathologies (dropout, latent confounders, cell-type mixing, feedback loops, network density, sample size, and pseudotime drift) and measure how six representative methods spanning three inference paradigms degrade as each pathology intensifies. Across 6,120 controlled experiments, we find that causal methods genuinely dominate in clean and structurally favorable regimes, but specific pathologies (notably dropout and latent confounders) selectively neutralize their advantages. We further introduce an error-type decomposition that reveals methods with similar aggregate accuracy commit qualitatively different errors. To probe whether single-pathology effects persist when multiple stressors co-occur, we perform an interaction sweep over the three most impactful pathologies and find that their joint effects are sub-additive, while also exposing density-conditional cross-overs invisible to single-dial analysis. Our findings offer a nuanced understanding of when and why different methods succeed or fail for GRN inference, providing actionable insights for method development and practical guidance for practitioners.

翻译：尽管因果方法在理论层面具有优势，但其在从单细胞RNA测序数据推断基因调控网络时，始终未能在众多真实基准测试中达到或超越基于相关性的基线方法——这一长期存在的谜团动摇了因果性在该任务中的价值。我们认为现有基准测试存在控制不足的问题，因其依赖真实或半真实数据评估，其中多重病理现象共存导致失效模式相互混淆，从而掩盖了不同推断方法在特定条件下的优劣表现。为弥补这一空白，我们提出受控诊断框架，该框架可分离七种生物学驱动的病理因素（缺失值、潜在混杂因子、细胞类型混合、反馈回路、网络密度、样本量和伪时间漂移），并测量六种代表三类推断范式的方法随各病理因素强度增加时的退化模式。在6,120组受控实验中，我们发现因果方法在洁净且结构有利的条件下确实占据主导地位，但特定病理因素（尤其是缺失值和潜在混杂因子）会选择性削弱其优势。我们进一步引入错误类型分解方法，揭示出具有相似聚合精度的不同方法会犯性质各异的错误。为探究单一病理效应在多重压力源共存时是否持续存在，我们对三个影响最显著的病理因素进行交互扫描实验，发现其联合效应呈现次可加性，同时暴露出单变量分析未能发现的密度条件性交叉现象。我们的研究为理解不同方法在GRN推断中成功或失败的具体条件提供了精细洞见，并为方法开发与从业者实践提供了可操作的指导意见。