No imputation without representation

By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.

翻译：通过填补数据集中的缺失值，填补技术使得这些数据集能够与无法自行处理缺失值的算法一同使用。然而，缺失值原则上可能包含有用信息，这些信息在填补过程中会丢失。缺失指示符方法可与填补技术结合使用，从而将此类信息表示为数据集的一部分。关于缺失指示符是否可能有益存在若干理论考量，但此前尚未在真实数据集上进行大规模实验来验证其对机器学习预测的影响。我们基于二十个真实数据集，针对三种填补策略和一系列不同的分类算法进行了此项实验。在后续实验中，我们为每个分类器确定了属性特定的缺失阈值，超过该阈值时缺失指示符更可能提升分类性能。在第二次后续实验中，我们评估了对独热编码分类属性进行数值填补的效果。我们得出以下结论：首先，缺失指示符通常能提升分类性能；其次，使用缺失指示符时，最近邻填补和迭代填补并未比简单的均值/众数填补带来更好的性能；第三，对于决策树，剪枝是防止过拟合的必要步骤；第四，分类属性缺失指示符提升性能的阈值低于数值属性；最后，数值属性的均值填补能保留部分缺失值的信息。因此，在不使用缺失指示符时，对独热编码分类属性采用均值填补而非众数填补可能更具优势。