Feature Selection in High-dimensional Space Using Graph-Based Methods

High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We describe an algorithm for selecting informative features in high-dimensional data, where each observation comes from one of $K$ different distributions. Our algorithm can be applied in a completely nonparametric setup without any distributional assumptions on the data, and it aims at outputting those features in the data, that contribute the most to the overall distributional variation. At the heart of our method is the recursive application of distribution-free graph-based tests on subsets of the feature set, located at different depths of a hierarchical clustering tree constructed from the data. Our algorithm recovers all truly contributing features with high probability, while ensuring optimal control on false-discovery. Finally, we show the superior performance of our method over other existing ones through synthetic data, and also demonstrate the utility of the method on two real-life datasets from the domains of climate change and single cell transcriptomics.

翻译：高维特征选择是机器学习、图像分析和基因组学等多个应用领域的核心问题。本文提出将基于图的检验作为特征选择的有效基础。我们描述了一种用于从高维数据中选取信息性特征的算法，其中每个观测值来自$K$个不同分布之一。该算法可在完全非参数框架下应用，无需对数据作任何分布假设，其目标是输出数据中对整体分布变异贡献最大的那些特征。该方法的核心是在不同层级的特征子集上递归应用无分布假设的图检验——这些子集位于从数据构建的层次聚类树的不同深度。我们的算法能以高概率恢复所有真正具有贡献性的特征，同时实现对错误发现率的最优控制。最后，通过合成数据展示了该方法相较于现有其他方法的优越性能，并分别在气候变化和单细胞转录组学领域的两个真实数据集上验证了其实用性。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

42+阅读 · 2022年10月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日