In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.
翻译:本文提出一种利用训练好的扩散模型估计数据流形维度的新框架。扩散模型可近似得分函数,即针对不同噪声水平的噪声扰动目标分布的对数密度梯度。当数据集中于高维环境空间中嵌入的流形附近时,随着噪声水平的降低,得分函数会指向流形方向——该方向成为似然函数最大增长方向。因此,在低噪声水平下,扩散模型为我们提供了数据流形法丛的近似访问途径。这使我们能够估计切空间的维度,进而得到数据流形的本征维度。据我们所知,本方法为首个基于深度学习的数据流形维度估计器,在欧几里得数据与图像数据的可控实验中均优于成熟的统计估计方法。