Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we initially compare and summarize these algorithms using a new guided information-based taxonomy. We then systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms under the settings of various data error rates, error types, and downstream analysis tasks, assessing their error reduction performance with a novel metric. Also, we develop an effective and unified repair optimization strategy that substantially benefits the state of the arts, as empirically confirmed. We demonstrate that, the pure clean data may not necessarily yield the best performance in data analysis tasks and data is always worth repairing regardless of error rate. Based on the found observations and insights, we provide some practical guidelines for 5 scenarios and 2 main data analysis tasks. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice. Finally, we outline research challenges and promising future directions in the data repair field.
翻译:在当今数据驱动的世界中,尤其在生成式人工智能时代,数据质量至关重要。带有错误和不一致性的脏数据通常会导致有缺陷的洞察、不可靠的决策,以及生成模型产生有偏见或低质量的输出。修复错误数据的研究已变得非常重要。现有的数据修复算法在信息利用、问题设定方面存在差异,且仅在有限场景下进行测试。本文首先利用一种新的基于引导信息的分类法对这些算法进行比较和总结。随后,我们在不同数据错误率、错误类型及下游分析任务的设定下,对12种主流数据修复算法进行了系统性的全面评估,并使用一种新颖的指标衡量其错误减少性能。此外,我们开发了一种有效且统一的修复优化策略,实证表明该策略显著提升了现有最优方法。我们证明,纯干净数据在数据分析任务中未必能产生最佳性能,且无论错误率高低,数据始终值得修复。基于发现的观察结果与见解,我们为5种场景和2种主要数据分析任务提供了实用指南。我们期望本文能使研究人员和用户更好地理解并在实践中部署数据修复算法。最后,我们概述了数据修复领域的研究挑战和有前景的未来方向。