Still More Shades of Null: A Benchmark for Responsible Missing Value Imputation

We present Shades-of-NULL, a benchmark for responsible missing value imputation. Our benchmark includes state-of-the-art imputation techniques, and embeds them into the machine learning development lifecycle. We model realistic missingness scenarios that go beyond Rubin's classic Missing Completely at Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR), to include multi-mechanism missingness (when different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between training and test). Another key novelty of our work is that we evaluate imputers holistically, based on the predictive performance, fairness and stability of the models that are trained and tested on the data they produce. We use Shades-of-NULL to conduct a large-scale empirical study involving 20,952 experimental pipelines, and find that, while there is no single best-performing imputation approach for all missingness types, interesting performance patterns do emerge when comparing imputer performance in simpler vs. more complex missingness scenarios. Further, while predictive performance, fairness and stability can be seen as orthogonal, we identify trade-offs among them that arise due to the combination of missingness scenario, the choice of an imputer, and the architecture of the model trained on the data post-imputation. We make Shades-of-NULL publicly available, and hope to enable researchers to comprehensively and rigorously evaluate new missing value imputation methods on a wide range of evaluation metrics, in plausible and socially meaningful missingness scenarios.

翻译：我们提出了Shades-of-NULL，一个用于负责任缺失值插补的基准测试。我们的基准集成了最先进的插补技术，并将其嵌入机器学习开发生命周期中。我们模拟了超越Rubin经典缺失机制——完全随机缺失(MCAR)、随机缺失(MAR)和非随机缺失(MNAR)——的现实缺失场景，包括多机制缺失（当数据中同时存在多种缺失模式时）和缺失机制偏移（当训练与测试阶段的缺失机制发生变化时）。我们工作的另一个关键创新在于，我们基于插补后数据所训练和测试模型的预测性能、公平性和稳定性，对插补方法进行整体评估。我们利用Shades-of-NULL开展了涉及20,952个实验流程的大规模实证研究，发现尽管不存在适用于所有缺失类型的单一最优插补方法，但在对比简单与复杂缺失场景下的插补器性能时，确实出现了有趣的性能模式。此外，虽然预测性能、公平性和稳定性可被视为正交维度，但我们识别出由缺失场景、插补器选择以及插补后数据训练模型架构三者共同作用所产生的权衡关系。我们将Shades-of-NULL公开提供，旨在帮助研究者在合理且具有社会意义的缺失场景中，通过广泛的评估指标全面而严谨地评估新的缺失值插补方法。