Feature selection on incomplete datasets is an exceptionally challenging task. Existing methods address this challenge by first employing imputation methods to complete the incomplete data and then conducting feature selection based on the imputed data. Since imputation and feature selection are entirely independent steps, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To address this, we propose a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: the M-stage and the W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. Specifically, the feature importance vector obtained in the current iteration of the W-stage serves as input for the next iteration of the M-stage. Experimental results on both artificially generated and real incomplete datasets demonstrate that the proposed method outperforms other approaches significantly.
翻译:在不完整数据集上进行特征选择是一项极具挑战性的任务。现有方法通常先采用插补方法完成不完整数据的填补,再基于插补后的数据进行特征选择。由于插补与特征选择是完全独立的步骤,因此在插补过程中无法考虑特征的重要性。然而在实际场景或数据集中,不同特征具有不同程度的差异重要性。为解决这一问题,我们提出了一种考虑特征重要性的不完整数据特征选择新框架。该框架主要由两个交替迭代阶段构成:M阶段和W阶段。在M阶段,根据给定的特征重要性向量及多个初始插补结果对缺失值进行插补;在W阶段,采用改进的reliefF算法基于插补后数据学习特征重要性向量。具体而言,当前W阶段迭代获得的特征重要性向量将作为下一轮M阶段迭代的输入。在人工生成和真实不完整数据集上的实验结果表明,所提方法显著优于其他方法。