The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
翻译:许多科学领域的目标在于理解观测变量分布背后的机制,这通常始于一组初始假设。因果发现使我们能够以普适化的方式推断机制——即将其视为因果关系集合,而无需针对特定领域进行专门设计。因果发现算法在结构化假设空间(由有向无环图集合定义)中进行搜索,以寻找最能解释数据的图结构。然而对于高维问题,这种搜索会变得难以处理,因此需要可扩展的因果发现算法来弥合这一差距。本文定义了一种具有理论保证的新型因果图划分方法,支持采用分治策略进行因果发现。我们利用超结构(即已学习或现有的候选假设集合)的概念对搜索空间进行划分。我们在特定假设条件下证明:基于因果图划分的学习始终能获得真实因果图的马尔可夫等价类。实验表明,在生物特性调优的合成网络及高达${10^4}$变量的网络中,我们的算法能够达到相当的精度且具有更快的求解速度。这使得该方法可适用于基因调控网络推断及其他具有高维结构化假设空间的领域。