In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. We prove that, if the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first estimator of the data manifold dimension based on diffusion models and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.
翻译:在这项工作中,我们提出了一种利用训练好的扩散模型来估计数据流形维度的新框架。扩散模型近似于得分函数,即在不同程度的噪声污染下,目标分布噪声污染版本的log密度梯度。我们证明,如果数据集中于嵌入在高维环境空间中的流形周围,那么随着噪声污染程度的降低,得分函数指向流形,因为该方向成为最大似然增加的方向。因此,在噪声污染较小的情况下,扩散模型让我们能够访问数据流形法丛的近似。这使得我们能够估计切空间的维度,从而估计数据流形的内在维度。据我们所知,我们的方法是首个基于扩散模型的数据流形维度估计器,并且在欧几里得数据和图像数据的受控实验中,它优于已有的成熟统计估计器。