Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.
翻译:数据目录在现代数据驱动型组织中发挥着关键作用,通过促进多样化数据资产的发现、理解和利用来提升效率。然而,确保其质量与可靠性是一项复杂任务,尤其在开放且大规模的数据环境中。本文提出一个可自动确定开放数据目录质量的框架,旨在解决对高效可靠质量评估机制的需求。该框架能够分析准确性、完整性、一致性、可扩展性和时效性等核心质量维度,提供多种方案评估目录间的兼容性与相似性,并实现来源追溯、可读性及许可条款等非核心质量维度的评估集。最终目标是赋能数据驱动型组织基于可信且精心管理的数据资产做出明智决策。本方法对应的源代码可从 https://www.github.com/jorge-martinez-gil/dataq/ 获取。