When processing high-dimensional datasets, a common pre-processing step is feature selection. Filter-based feature selection algorithms are not tailored to a specific classification method, but rather rank the relevance of each feature with respect to the target and the task. This work focuses on a graph-based, filter feature selection method that is suited for multi-class classifications tasks. We aim to drastically reduce the number of selected features, in order to create a sketch of the original data that codes valuable information for the classification task. The proposed graph-based algorithm is constructed by combing the Jeffries-Matusita distance with a non-linear dimension reduction method, diffusion maps. Feature elimination is performed based on the distribution of the features in the low-dimensional space. Then, a very small number of feature that have complementary separation strengths, are selected. Moreover, the low-dimensional embedding allows to visualize the feature space. Experimental results are provided for public datasets and compared with known filter-based feature selection techniques.
翻译:在处理高维数据集时,常见的预处理步骤是特征选择。基于过滤器的特征选择算法不针对特定分类方法,而是根据特征与目标及任务的关联性进行排序。本文聚焦于一种适用于多分类任务的基于图的过滤器特征选择方法。我们旨在大幅减少所选特征的数量,从而构建原始数据的概要,为分类任务编码有价值的信息。所提出的基于图的算法结合了Jeffries-Matusita距离与非线性降维方法——扩散映射。特征剔除基于特征在低维空间中的分布进行,随后选取少量具有互补分离能力的特征。此外,低维嵌入可实现对特征空间的可视化。我们在公开数据集上提供了实验结果,并与已知的基于过滤器的特征选择技术进行了比较。