Selecting Features by their Resilience to the Curse of Dimensionality

Real-world datasets are often of high dimension and effected by the curse of dimensionality. This hinders their comprehensibility and interpretability. To reduce the complexity feature selection aims to identify features that are crucial to learn from said data. While measures of relevance and pairwise similarities are commonly used, the curse of dimensionality is rarely incorporated into the process of selecting features. Here we step in with a novel method that identifies the features that allow to discriminate data subsets of different sizes. By adapting recent work on computing intrinsic dimensionalities, our method is able to select the features that can discriminate data and thus weaken the curse of dimensionality. Our experiments show that our method is competitive and commonly outperforms established feature selection methods. Furthermore, we propose an approximation that allows our method to scale to datasets consisting of millions of data points. Our findings suggest that features that discriminate data and are connected to a low intrinsic dimensionality are meaningful for learning procedures.

翻译：现实世界的数据集通常具有高维特性，并受到维度灾难的影响，这阻碍了其可理解性与可解释性。为降低复杂性，特征选择旨在识别从该数据中学习的关键特征。尽管相关性度量与成对相似性方法被广泛使用，但维度灾难很少被纳入特征选择过程。本文提出一种新方法，通过识别能够区分不同规模数据子集的特征来应对这一挑战。通过改进近期关于内在维度计算的研究成果，我们的方法能够筛选出具有数据区分能力的特征，从而削弱维度灾难的影响。实验表明，该方法与现有特征选择方法相比具有竞争力，且通常表现更优。此外，我们提出一种近似策略，使该方法可扩展至包含数百万个数据点的数据集。研究结果表明，具有数据区分能力且与低内在维度相关联的特征对学习过程具有实际意义。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

【Facebook-Ishan Mishra】计算机视觉自监督学习，92页ppt

专知会员服务

36+阅读 · 2021年7月7日