Data preparation, especially data cleaning, is very important to ensure data quality and to improve the output of automated decision systems. Since there is no single tool that covers all steps required, a combination of tools -- namely a data preparation pipeline -- is required. Such process comes with a number of challenges. We outline the challenges and describe the different tasks we want to analyze in our future research to address these. A test data generator which we implemented to constitute the basis for our future work will also be introduced in detail.
翻译:数据准备,尤其是数据清洗,对于确保数据质量和改进自动化决策系统的输出至关重要。由于没有任何单一工具能涵盖所有必要步骤,因此需要组合多种工具——即构建一个数据准备流程。这一过程面临诸多挑战。我们概述了这些挑战,并描述了未来研究中计划分析的不同任务以应对它们。此外,还将详细介绍我们为构成未来工作基础而实现的一个测试数据生成器。