The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also appear erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure to noise by benchmarks on artificial and real-world datasets.
翻译:本征维数是无监督学习和特征选择中的关键概念,它是描述一个系统所需变量的下界。然而,在几乎所有的真实数据集中,本征维数都依赖于数据分析的尺度。通常在小尺度下,本征维数非常大,因为数据受到测量误差的影响;在大尺度下,由于包含数据的流形的曲率和拓扑结构,本征维数也可能看似偏大。本文提出了一种自动选择"最佳尺度"——即本征维数具有意义和效用的正确尺度范围——的协议。该协议基于以下假设:在小于正确尺度的距离内,数据密度为常数。在所提出的框架中,估计密度需要先知道本征维数,因此该条件以自洽方式施加。通过人工和真实数据集的基准测试,我们证明了该过程对噪声的实用性和稳健性。