Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce concepts of minimal and almost minimal repair, which are subsets of missing data items in training data whose imputation delivers accurate and reasonably accurate models, respectively. Imputing these subsets can significantly reduce the time, computational resources, and manual effort required for learning. We show that finding these subsets is NP-hard for some popular models and propose efficient approximation algorithms for wide range of models. Our extensive experiments indicate that our proposed algorithms can substantially reduce the time and effort required to learn on incomplete datasets.
翻译:现实数据集常存在缺失数据,修复这些数据需要耗费大量时间和精力才能训练出准确模型。本文证明,并非所有缺失值都必须填补才能获得准确的机器学习模型。我们提出最小修复与近似最小修复的概念,分别指代训练数据中那些经填补后能够产生准确模型或合理准确模型的缺失数据子集。填补这些子集可显著减少学习所需的时间、计算资源和人工投入。我们证明,对于某些流行模型而言,找出这些子集是NP困难问题,并针对多种模型提出高效近似算法。大量实验表明,所提算法能大幅减少在不完整数据集上学习所需的时间和精力。