The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms.
翻译:流形假设是机器学习中广泛接受的基本原理,其断言名义上的高维数据实际上集中在嵌入高维空间的低维流形附近。这一现象在许多现实场景中通过经验观察得到验证,近几十年来催生了多种统计方法的发展,并被视作现代人工智能技术成功的关键因素之一。我们证明,数据中丰富且有时复杂的流形结构可源自一个通用且异常简单的统计模型——潜在度量模型——通过潜变量、相关性和平稳性等基础概念实现。这为流形假设在众多情境中似乎成立的现象建立了通用的统计解释。基于潜在度量模型的启示,我们推导出发现和解读高维数据几何结构的方法,并探索关于数据生成机制的假设。这些方法在最小假设下运行,并利用众所周知的、可扩展的图分析算法。