Identifiability of statistical models is a key notion in unsupervised representation learning. Recent work of nonlinear independent component analysis (ICA) employs auxiliary data and has established identifiable conditions. This paper proposes a statistical model of two latent vectors with single auxiliary data generalizing nonlinear ICA, and establishes various identifiability conditions. Unlike previous work, the two latent vectors in the proposed model can have arbitrary dimensions, and this property enables us to reveal an insightful dimensionality relation among two latent vectors and auxiliary data in identifiability conditions. Furthermore, surprisingly, we prove that the indeterminacies of the proposed model has the same as \emph{linear} ICA under certain conditions: The elements in the latent vector can be recovered up to their permutation and scales. Next, we apply the identifiability theory to a statistical model for graph data. As a result, one of the identifiability conditions includes an appealing implication: Identifiability of the statistical model could depend on the maximum value of link weights in graph data. Then, we propose a practical method for identifiable graph embedding. Finally, we numerically demonstrate that the proposed method well-recovers the latent vectors and model identifiability clearly depends on the maximum value of link weights, which supports the implication of our theoretical results
翻译:统计模型的可辨识性是无监督表示学习中的一个核心概念。非线性独立成分分析(ICA)的最新研究利用辅助数据,并已建立了可辨识性条件。本文提出了一种具有单辅助数据的双潜在向量统计模型,该模型推广了非线性ICA,并建立了多种可辨识性条件。与先前工作不同,所提模型中的两个潜在向量可以具有任意维度,这一特性使我们能够在可辨识性条件中揭示两个潜在向量与辅助数据之间富有启发性的维度关系。此外,令人惊讶的是,我们证明了在某些条件下,所提模型的不确定性程度与\emph{线性}ICA相同:潜在向量中的元素可以恢复到其排列和尺度变换的程度。接下来,我们将可辨识性理论应用于图数据的统计模型。结果表明,其中一个可辨识性条件包含了一个引人注目的含义:统计模型的可辨识性可能取决于图数据中链接权重的最大值。然后,我们提出了一种用于可辨识图嵌入的实用方法。最后,我们通过数值实验证明,所提方法能较好地恢复潜在向量,并且模型可辨识性明显依赖于链接权重的最大值,这支持了我们理论结果的启示。