Controlling false positives (Type I errors) through statistical hypothesis testing is a foundation of modern scientific data analysis. Existing causal structure discovery algorithms either do not provide Type I error control or cannot scale to the size of modern scientific datasets. We consider a variant of the causal discovery problem with two sets of nodes, where the only edges of interest form a bipartite causal subgraph between the sets. We develop Scalable Causal Structure Learning (SCSL), a method for causal structure discovery on bipartite subgraphs that provides Type I error control. SCSL recasts the discovery problem as a simultaneous hypothesis testing problem and uses discrete optimization over the set of possible confounders to obtain an upper bound on the test statistic for each edge. Semi-synthetic simulations demonstrate that SCSL scales to handle graphs with hundreds of nodes while maintaining error control and good power. We demonstrate the practical applicability of the method by applying it to a cancer dataset to reveal connections between somatic gene mutations and metastases to different tissues.
翻译:通过统计假设检验控制假阳性(第I类错误)是现代科学数据分析的基础。现有的因果结构发现算法要么无法提供第I类错误控制,要么无法扩展到现代科学数据集的规模。我们考虑一个包含两组节点的因果发现问题变体,其中唯一感兴趣的边形成两组之间的二分因果子图。我们开发了可扩展因果结构学习(SCSL)——一种针对二分子图提供第I类错误控制的因果结构发现方法。SCSL将发现问题重构为同步假设检验问题,并通过离散优化潜在混杂因素集合来获得每条边检验统计量的上界。半合成模拟实验表明,SCSL可扩展至包含数百个节点的图结构,同时保持错误控制与良好的统计功效。我们将该方法应用于癌症数据集,揭示体细胞基因突变与不同组织转移之间的关联,验证了方法的实际应用价值。