A novel feature selection framework for incomplete data

Feature selection on incomplete datasets is an exceptionally challenging task. Existing methods address this challenge by first employing imputation methods to complete the incomplete data and then conducting feature selection based on the imputed data. Since imputation and feature selection are entirely independent steps, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To address this, we propose a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: the M-stage and the W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. Specifically, the feature importance vector obtained in the current iteration of the W-stage serves as input for the next iteration of the M-stage. Experimental results on both artificially generated and real incomplete datasets demonstrate that the proposed method outperforms other approaches significantly.

翻译：在不完整数据集上进行特征选择是一项极具挑战性的任务。现有方法通常先采用插补方法完成不完整数据的填补，再基于插补后的数据进行特征选择。由于插补与特征选择是完全独立的步骤，因此在插补过程中无法考虑特征的重要性。然而在实际场景或数据集中，不同特征具有不同程度的差异重要性。为解决这一问题，我们提出了一种考虑特征重要性的不完整数据特征选择新框架。该框架主要由两个交替迭代阶段构成：M阶段和W阶段。在M阶段，根据给定的特征重要性向量及多个初始插补结果对缺失值进行插补；在W阶段，采用改进的reliefF算法基于插补后数据学习特征重要性向量。具体而言，当前W阶段迭代获得的特征重要性向量将作为下一轮M阶段迭代的输入。在人工生成和真实不完整数据集上的实验结果表明，所提方法显著优于其他方法。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日