数据复杂性的几何视角：基于扩散模型的高效局部本征维数估计 (A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models)

High-dimensional data commonly lies on low-dimensional submanifolds, and estimating the local intrinsic dimension (LID) of a datum -- i.e. the dimension of the submanifold it belongs to -- is a longstanding problem. LID can be understood as the number of local factors of variation: the more factors of variation a datum has, the more complex it tends to be. Estimating this quantity has proven useful in contexts ranging from generalization in neural networks to detection of out-of-distribution data, adversarial examples, and AI-generated text. The recent successes of deep generative models present an opportunity to leverage them for LID estimation, but current methods based on generative models produce inaccurate estimates, require more than a single pre-trained model, are computationally intensive, or do not exploit the best available deep generative models: diffusion models (DMs). In this work, we show that the Fokker-Planck equation associated with a DM can provide an LID estimator which addresses the aforementioned deficiencies. Our estimator, called FLIPD, is easy to implement and compatible with all popular DMs. Applying FLIPD to synthetic LID estimation benchmarks, we find that DMs implemented as fully-connected networks are highly effective LID estimators that outperform existing baselines. We also apply FLIPD to natural images where the true LID is unknown. Despite being sensitive to the choice of network architecture, FLIPD estimates remain a useful measure of relative complexity; compared to competing estimators, FLIPD exhibits a consistently higher correlation with image PNG compression rate and better aligns with qualitative assessments of complexity. Notably, FLIPD is orders of magnitude faster than other LID estimators, and the first to be tractable at the scale of Stable Diffusion.

翻译：高维数据通常位于低维子流形上，而估计数据点的局部本征维数——即其所属子流形的维数——是一个长期存在的问题。局部本征维数可理解为局部变异因素的数量：数据点拥有的变异因素越多，其往往越复杂。估计这一量值已被证明在神经网络泛化、分布外数据检测、对抗样本识别以及AI生成文本鉴别等多种场景中具有实用价值。深度生成模型近年来的成功为利用其进行局部本征维数估计提供了契机，但现有基于生成模型的方法存在估计不准确、需要多个预训练模型、计算成本高昂或未能充分利用当前最优的深度生成模型——扩散模型等问题。本研究表明，与扩散模型相关的福克-普朗克方程可推导出一种能够解决上述缺陷的局部本征维数估计器。我们提出的FLIPD估计器易于实现，且兼容所有主流扩散模型。在合成局部本征维数估计基准测试中，我们发现以全连接网络实现的扩散模型能成为高效的局部本征维数估计器，其性能优于现有基线方法。我们还将FLIPD应用于真实本征维数未知的自然图像。尽管FLIPD估计值对网络架构选择较为敏感，但其仍可作为相对复杂度的有效度量指标：相较于其他估计器，FLIPD与图像PNG压缩率始终保持着更高的相关性，且更符合复杂度的定性评估结果。值得注意的是，FLIPD的计算速度比其他局部本征维数估计器快数个数量级，是首个能够在Stable Diffusion规模上实现可扩展计算的估计方法。