The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms.
翻译:流形假说是机器学习领域广泛接受的基本原理,认为名义上的高维数据实际上集中在嵌入高维空间的低维流形附近。这一现象在众多实际场景中均得到经验验证,近几十年来催生了大量统计方法的发展,并且被认为是现代人工智能技术成功的关键因素之一。我们证明,数据中丰富且有时复杂的流形结构可以通过一个通用且极其简单的统计模型——潜在度量模型——从潜在变量、相关性和平稳性等基本概念中涌现出来。这为流形假说在众多情境下成立提供了通用的统计学解释。基于潜在度量模型,我们推导出用于发现和解释高维数据几何结构的方法,并探索数据生成机制的假设。这些方法在极少假设条件下运行,并利用了广为人知的可扩展图分析算法。