We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider the problem of inverse correlation matrix estimation under the linear polytree models, and establish the estimation error bound in terms of the dimension and the total number of v-structures. We also consider an extension of group linear polytree models, in which each node represents a group of variables. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees.
翻译:我们关注当数据由线性结构方程模型生成且因果结构可表征为多叉树时,有向无环图的学习问题。在高斯多叉树模型框架下,我们研究了使经典Chow-Liu算法能够精确恢复多叉树骨架及等价类(由CPDAG唯一表示)的样本量充分条件。另一方面,基于信息论下界推导了骨架与CPDAG恢复所需样本量的必要条件,这些条件与对应充分条件相匹配,从而给出了这些任务难度的精确刻画。我们还考虑了线性多叉树模型下的逆相关矩阵估计问题,建立了关于维度及V-结构总数的估计误差界。此外,我们探讨了分组线性多叉树模型的扩展,其中每个节点代表一组变量。综合数值模拟验证了理论发现,基准数据集上的实验也表明当真实图结构仅能由多叉树近似时,多叉树学习仍具有良好的鲁棒性。