Identifying latent variables and causal structures from observational data is essential to many real-world applications involving biological data, medical data, and unstructured data such as images and languages. However, this task can be highly challenging, especially when observed variables are generated by causally related latent variables and the relationships are nonlinear. In this work, we investigate the identification problem for nonlinear latent hierarchical causal models in which observed variables are generated by a set of causally related latent variables, and some latent variables may not have observed children. We show that the identifiability of both causal structure and latent variables can be achieved under mild assumptions: on causal structures, we allow for the existence of multiple paths between any pair of variables in the graph, which relaxes latent tree assumptions in prior work; on structural functions, we do not make parametric assumptions, thus permitting general nonlinearity and multi-dimensional continuous variables. Specifically, we first develop a basic identification criterion in the form of novel identifiability guarantees for an elementary latent variable model. Leveraging this criterion, we show that both causal structures and latent variables of the hierarchical model can be identified asymptotically by explicitly constructing an estimation procedure. To the best of our knowledge, our work is the first to establish identifiability guarantees for both causal structures and latent variables in nonlinear latent hierarchical models.
翻译:从观测数据中识别潜变量及因果结构对涉及生物数据、医学数据以及图像、语言等非结构化数据的众多实际应用至关重要。然而,当观测变量由存在因果关系的潜变量生成且关系为非线性时,该任务极具挑战性。本文研究了非线性潜变量层次因果模型的识别问题,其中观测变量由一组存在因果关系的潜变量生成,且部分潜变量可能没有观测子节点。我们证明,在温和假设下可实现因果结构与潜变量的可识别性:在因果结构层面,允许图中任意变量间存在多条路径,这放宽了先前工作中的潜变量树假设;在结构函数层面,不作参数化假设,从而允许广义非线性和多维连续变量。具体而言,我们首先针对基础潜变量模型提出以新型可识别性保证形式呈现的基本识别准则。利用该准则,我们通过显式构建估计过程证明了层次模型的因果结构与潜变量均可渐进识别。据我们所知,本文首次建立了非线性潜变量层次模型中因果结构与潜变量的可识别性保证。