Still More Shades of Null: An Evaluation Suite for Responsible Missing Value Imputation

Data missingness is a practical challenge of sustained interest to the scientific community. In this paper, we present Shades-of-Null, an evaluation suite for responsible missing value imputation. Our work is novel in two ways (i) we model realistic and socially-salient missingness scenarios that go beyond Rubin's classic Missing Completely at Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) settings, to include multi-mechanism missingness (when different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between training and test) (ii) we evaluate imputers holistically, based on imputation quality, as well as on the predictive performance, fairness and stability of the models that are trained and tested on the data post-imputation. We use Shades-of-Null to conduct a large-scale empirical study involving 23,940 experimental pipelines, and find that while there is no single best-performing imputation approach for all missingness types, interesting trade-offs arise between predictive performance, fairness and stability, based on the combination of missingness scenario, imputer choice, and the architecture of the predictive model. We make Shades-of-Null publicly available, to enable researchers to rigorously evaluate missing value imputation methods on a wide range of metrics in plausible and socially meaningful scenarios.

翻译：数据缺失性是科学界持续关注的实际挑战。本文提出Shades-of-Null——一个用于负责任缺失值插补的评估套件。本研究的创新性体现在两个方面：（i）我们建模了超越Rubin经典框架（完全随机缺失、随机缺失、非随机缺失）的现实且具有社会显著性的缺失场景，包括多机制缺失（当数据中同时存在不同缺失模式时）和缺失机制偏移（当训练与测试阶段的缺失机制发生变化时）；（ii）我们采用整体化评估方法，不仅衡量插补质量，还综合评估基于插补后数据训练和测试的模型在预测性能、公平性和稳定性方面的表现。通过Shades-of-Null，我们开展了包含23,940个实验流程的大规模实证研究，发现虽然不存在适用于所有缺失类型的最优插补方法，但基于缺失场景、插补器选择与预测模型架构的组合，预测性能、公平性与稳定性之间会产生值得权衡的交互关系。我们将Shades-of-Null公开开源，使研究者能够在合理且具有社会意义的场景中，基于多维度指标严格评估缺失值插补方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日