The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
翻译:许多科学领域的目标在于理解观测变量分布背后的机制,这通常始于一组初始假设。因果发现使我们能够以广义方式推断机制——即作为因果关系集合——而无需针对特定领域进行专门化设计。因果发现算法在由有向无环图集合定义的结构化假设空间中进行搜索,以找到最能解释数据的图结构。然而对于高维问题,这种搜索会变得难以处理,因此需要可扩展的因果发现算法来弥合这一差距。本文定义了一种新颖的因果图分割方法,支持具有理论保证的分治式因果发现。我们利用超结构(即一组已学习或已有的候选假设集合)的概念对搜索空间进行划分。我们在特定假设条件下证明,基于因果图分割的学习始终能够获得真实因果图的马尔可夫等价类。实验表明,我们的算法在生物调谐合成网络及高达${10^4}$变量的网络上实现了相当的精度与更快的求解速度。这使得我们的方法适用于基因调控网络推断及其他具有高维结构化假设空间的领域。