Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin (1976), a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, it remains nevertheless challenging to decide which method is most suited for their problem, partially due to a lack of systematic covering of this topic in statistics or data science curricula. To help address this challenge, we have launched the "R-miss-tastic" platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations), "R-miss-tastic" covers the development of standardized analysis workflows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and also teachers who are looking for didactic materials (notebooks, video, slides).
翻译:缺失值在数据处理过程中是不可避免的。随着来自不同来源的数据日益增多,缺失值的出现情况也愈加严重。然而,大多数统计模型和可视化方法都需要完整的数据,对缺失数据的不当处理会导致信息丢失或分析偏差。自Rubin(1976)的开创性工作以来,关于缺失值的研究文献大量涌现,其目标和动机各异。这催生了各种方法、形式化工具和软件工具的发展。然而,对于实践者而言,决定哪种方法最适合其问题仍然具有挑战性,部分原因是统计学或数据科学课程中缺乏对这一主题的系统性覆盖。为帮助应对这一挑战,我们推出了“R-miss-tastic”平台,旨在提供标准缺失值问题、方法及相关方法论实现的概览。除了汇集和组织关于缺失数据的大部分材料(参考文献、课程、教程、实现)外,“R-miss-tastic”还涵盖了标准化分析工作流的开发。事实上,我们已经在R和Python中开发了多个流程,以便在矩阵补全、估计与预测等各种统计任务中,对手动处理缺失值进行实践演示并提供建议,同时确保分析的可复现性。最后,该平台面向分析不完整数据的用户、希望比较其方法并查找最新参考文献的研究人员,以及寻找教学材料(笔记本、视频、幻灯片)的教师。