The characteristics and interpretability of data become more abstract and complex as the dimensionality increases. Common patterns and relationships that hold in in low-dimensional space may fail to hold in higher-dimensional space. This phenomenon leads to a decreasing performance for the regression, classification or clustering models or algorithms, which is known as curse of dimensionality. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize five challenges associated with manipulating high-dimensional data, and explains the potential causes for the failure of regression, classification or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that nearest neighbor search (NNS) using three typical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless as the dimensionality increases. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions. By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms, and drive to improve the performance of data analysis and machine learning tasks in high-dimensional space.
翻译:随着维度的增加,数据的特征与可解释性变得更加抽象和复杂。在低维空间中成立的常见模式与关系在高维空间中可能失效。这种现象导致回归、分类或聚类模型或算法的性能下降,即所谓的维度灾难。维度灾难可能由多种原因引起。本文首先总结了处理高维数据时面临的五大挑战,并解释了回归、分类或聚类任务失败的可能原因。随后,我们通过理论与实证分析,深入探讨了维度灾难的两个主要原因:距离集中效应与流形效应。结果表明,使用三种典型距离度量——闵可夫斯基距离、切比雪夫距离和余弦距离的最近邻搜索(NNS)随着维度增加将失去意义。同时,数据包含更多冗余特征,主成分分析(PCA)的方差贡献会向少数维度倾斜。通过解释维度灾难的成因,我们可以更深入地理解当前模型与算法的局限性,进而推动提升高维空间中数据分析与机器学习任务的性能。