Many datasets suffer from missing values due to various reasons,which not only increases the processing difficulty of related tasks but also reduces the accuracy of classification. To address this problem, the mainstream approach is to use missing value imputation to complete the dataset. Existing imputation methods estimate the missing parts based on the observed values in the original feature space, and they treat all features as equally important during data completion, while in fact different features have different importance. Therefore, we have designed an imputation method that considers feature importance. This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance. Our experimental analysis involves three types of datasets: synthetic datasets with different noisy features and missing values, real-world datasets with artificially generated missing values, and real-world datasets originally containing missing values. The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.To the best of our knowledge, this is the first work that considers feature importance in the imputation model.
翻译:许多数据集因各种原因存在缺失值,这不仅增加了相关任务的处理难度,还降低了分类的准确性。为解决此问题,主流方法采用缺失值插补来补全数据集。现有插补方法基于原始特征空间中的观测值估计缺失部分,且在数据补全过程中将所有特征视为同等重要,而实际不同特征的重要性存在差异。因此,我们设计了一种考虑特征重要性的插补方法。该算法迭代执行矩阵补全与特征重要性学习,具体而言,矩阵补全基于融入特征重要性的填充损失函数。我们的实验分析涉及三类数据集:含不同噪声特征与缺失值的合成数据集、人工生成缺失值的真实数据集以及原始就包含缺失值的真实数据集。实验结果表明,所提方法在所有数据集上均优于现有五种插补算法。据我们所知,这是首个在插补模型中考虑特征重要性的研究工作。