Several causal discovery algorithms have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. And some methods are not feasible when the sample size is smaller than the number of variables. To circumvent these problems, some researchers proposed causal structure learning algorithms using divide-and-conquer approaches. For learning the entire causal graph, the approaches first split variables into several subsets according to the conditional independence relationships among the variables, then apply a conventional causal discovery algorithm to each subset and merge the estimated results. Since the divide-and-conquer approach reduces the number of variables to which a causal structure learning algorithm is applied, it is expected to improve the estimation accuracy of causal graphs, especially when the sample size is small relative to the number of variables and the model is sparse. However, existing methods are either computationally expensive or do not provide sufficient accuracy when the sample size is small. This paper proposes a new algorithm for grouping variables based the ancestral relationships among the variables, under the LiNGAM assumption, where the causal relationships are linear, and the mutually independent noise are distributed as continuous non-Gaussian distributions. We call the proposed algorithm CAG. The time complexity of the ancestor finding in CAG is shown to be cubic to the number of variables. Extensive computer experiments confirm that the proposed method outperforms the original DirectLiNGAM without grouping variables and other divide-and-conquer approaches not only in estimation accuracy but also in computation time when the sample size is small relative to the number of variables and the model is sparse.
翻译:已有多种因果发现算法被提出。然而,当样本量相对于变量数较少时,现有方法估计因果图的准确性会下降。部分方法在样本量小于变量数时甚至不可行。为规避这些问题,有研究者提出了采用分治策略的因果结构学习算法。这类方法首先根据变量间的条件独立关系将变量划分为若干子集,然后对每个子集应用传统因果发现算法,最后合并各子集的估计结果。由于分治策略减少了因果结构学习算法所处理的变量数量,有望提升因果图的估计精度,尤其在样本量相对变量数较少且模型稀疏的场景下。然而,现有方法要么计算开销大,要么在样本量较小时精度不足。本文基于LiNGAM假设(其中因果关系为线性,相互独立的噪声服从连续非高斯分布),提出了一种根据变量间祖先关系进行变量分组的新算法。我们将所提算法命名为CAG。理论分析表明,CAG中祖先查找的时间复杂度与变量数的立方成正比。大量计算机实验证实,当样本量相对于变量数较少且模型稀疏时,所提方法不仅估计精度优于未分组的原始DirectLiNGAM及其他分治方法,计算时间也更具优势。