Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. Results: We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. Conclusion: We believe that our review of the literature will help the community develop better approaches to clean data.
翻译:语境:机器学习(ML)正被集成到越来越多的系统中,用于各类应用。由于ML模型的性能高度依赖于其训练数据的质量,因此检测和修复数据错误(即数据清洗)的方法日益受到关注。研究者也在探索如何将ML应用于数据清洗,从而形成ML与数据清洗之间的双向关系。据我们所知,目前尚无研究全面综述这一关系。目标:本文的目标有两方面。首先,旨在总结针对ML的数据清洗及面向数据清洗的ML的最新方法;其次,提供未来工作建议。方法:我们对2016年至2022年(含)期间发表的论文进行了系统性文献综述。我们识别了涉及ML及利用ML的不同类型数据清洗活动:特征清洗、标签清洗、实体匹配、异常值检测、缺失值填补以及整体数据清洗。结果:我们总结了涵盖多种数据清洗活动的101篇论文的内容,并提出了24项未来工作建议。本综述突出了许多具有扩展潜力的有前景的数据清洗技术。结论:我们相信,本文献综述将有助于学术界开发更优的数据清洗方法。