The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
翻译:本征维度是无监督学习与特征选择中的核心概念,它描述了表征一个系统所需变量数量的下界。然而,在几乎所有现实数据集中,本征维度都依赖于数据被分析时所采用的尺度。典型情况下,在小尺度上,由于数据受测量误差影响,本征维度会非常大;而在大尺度上,由于数据所在流形的曲率与拓扑结构,本征维度也可能被错误地高估。本研究提出了一种自动选择最优尺度区间的协议,即确定本征维度具有实际意义与实用价值的正确尺度范围。该协议基于以下原理:在小于正确尺度的距离范围内,数据密度应保持恒定。在此框架中,密度估计需要已知本征维度,因此该条件是以自洽方式施加的。我们通过人工数据集与真实数据集的基准测试,验证了该方法的实用性与鲁棒性。