Data-oriented applications, their users, and even the law require data of high quality. Research has divided the rather vague notion of data quality into various dimensions, such as accuracy, consistency, and reputation. To achieve the goal of high data quality, many tools and techniques exist to clean and otherwise improve data. Yet, systematic research on actually assessing data quality in its dimensions is largely absent, and with it, the ability to gauge the success of any data cleaning effort. We propose five facets as ingredients to assess data quality: data, source, system, task, and human. Tapping each facet for data quality assessment poses its own challenges. We show how overcoming these challenges helps data quality assessment for those data quality dimensions mentioned in Europe's AI Act. Our work concludes with a proposal for a comprehensive data quality assessment framework.
翻译:数据驱动型应用、其用户乃至法律法规均要求高质量的数据。研究将数据质量这一较为模糊的概念划分为多个维度,如准确性、一致性和可信度。为实现高质量数据的目标,现有多种工具与技术可用于数据清洗及质量提升。然而,针对数据质量各维度实际评估的系统性研究仍较为缺乏,这亦导致难以衡量数据清洗工作的成效。我们提出数据质量评估的五个要素:数据、来源、系统、任务与人员。挖掘每个要素以评估数据质量均面临独特挑战。我们展示了如何通过克服这些挑战来助力实现欧盟《人工智能法案》中提及的数据质量维度评估。本研究最终提出一个综合性数据质量评估框架。