Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.
翻译:缺失数据是数据科学中的根本性挑战,严重阻碍了医疗健康、生物信息学、社会科学、电子商务和工业监测等多个领域的分析与决策。尽管经过数十年研究并已发展出众多插补方法,但相关文献在各领域间仍然碎片化,亟需一份能够将统计学基础与现代机器学习进展联系起来的综合性评述。本文系统地回顾了核心概念——包括缺失机制、单重与多重插补以及不同的插补目标——并考察了跨领域的问题特征。我们对插补方法进行了详尽的分类,涵盖经典技术(如回归、EM算法)到现代方法,包括低秩与高秩矩阵补全、深度学习模型(自编码器、生成对抗网络、扩散模型、图神经网络)以及大型语言模型。特别关注面向复杂数据类型的方法,例如张量、时间序列、流数据、图结构数据、类别数据和多模态数据。在方法论之外,我们深入探讨了插补与下游任务(如分类、聚类和异常检测)的关键整合,考察了顺序流水线和联合优化框架。该评述还评估了理论保证、基准资源和评估指标。最后,我们识别了关键挑战和未来方向,强调了模型选择与超参数优化、通过联邦学习实现隐私保护插补的日益重要性,以及追求能够跨领域和数据类型适应的通用模型,从而为未来研究绘制了一幅路线图。