Data imputation, the process of filling in missing feature elements for incomplete data sets, plays a crucial role in data-driven learning. A fundamental belief is that data imputation is helpful for learning performance, and it follows that the pursuit of better classification can guide the data imputation process. While some works consider using label information to assist in this task, their simplistic utilization of labels lacks flexibility and may rely on strict assumptions. In this paper, we propose a new framework that effectively leverages supervision information to complete missing data in a manner conducive to classification. Specifically, this framework operates in two stages. Firstly, it leverages labels to supervise the optimization of similarity relationships among data, represented by the kernel matrix, with the goal of enhancing classification accuracy. To mitigate overfitting that may occur during this process, a perturbation variable is introduced to improve the robustness of the framework. Secondly, the learned kernel matrix serves as additional supervision information to guide data imputation through regression, utilizing the block coordinate descent method. The superiority of the proposed method is evaluated on four real-world data sets by comparing it with state-of-the-art imputation methods. Remarkably, our algorithm significantly outperforms other methods when the data is missing more than 60\% of the features
翻译:数据填补是为不完整数据集填充缺失特征元素的过程,在数据驱动学习中起着关键作用。一个基本观点是:数据填补有助于提升学习性能,而追求更好的分类效果可以反过来指导数据填补过程。尽管已有研究尝试利用标签信息辅助此任务,但其对标签的简单化使用缺乏灵活性,且可能依赖严格假设。本文提出一种新框架,能有效利用监督信息以有利于分类的方式完成缺失数据填补。该框架具体分为两个阶段:首先,利用标签监督数据间相似性关系(以核矩阵表示)的优化,旨在提升分类准确率;为缓解此过程可能出现的过拟合,引入扰动变量以增强框架鲁棒性。其次,将学习到的核矩阵作为额外监督信息,通过回归方式指导数据填补,并采用块坐标下降法进行优化。通过在四个真实数据集上与前沿填补方法进行比较,验证了所提方法的优越性。值得注意的是,当数据缺失超过60%的特征时,我们的算法显著优于其他方法。