The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as the dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize five challenges associated with manipulating high-dimensional data, and explains the potential causes for the failure of regression, classification or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that nearest neighbor search (NNS) using three typical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless as the dimensionality increases. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions. By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms, and drive to improve the performance of data analysis and machine learning tasks in high-dimensional space.
翻译:随着数据维度的增加,数据分布与异质性等特征变得愈发复杂且反直觉。这一现象被称为维度灾难,此时在低维空间成立的常见模式与关系(如内部与边界模式)在高维空间中可能失效,导致回归、分类或聚类模型及算法的性能下降。维度灾难的成因众多。本文首先总结了处理高维数据时面临的五大挑战,并阐释回归、分类或聚类任务失效的潜在原因。随后,我们通过理论与实证分析,深入探讨维度灾难的两大主因:距离集中与流形效应。结果表明,采用闵可夫斯基距离、切比雪夫距离和余弦距离这三种典型距离度量的最近邻搜索(NNS)会随维度增加而失去意义。同时,数据包含更多冗余特征,主成分分析(PCA)的方差贡献向少数维度倾斜。通过解读维度灾难的成因,我们能更清晰地理解当前模型与算法的局限性,从而推动高维空间数据分析与机器学习任务性能的提升。