Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.
翻译:数据目录在现代数据驱动型组织中发挥着关键作用,能够促进多样化数据资产的发现、理解与利用。然而,在开放型和大规模数据环境下,确保数据目录的质量与可靠性仍是一项复杂挑战。本文提出一种自动判定开放数据目录质量的框架,以应对高效且可靠的质量评估机制需求。该框架可分析多个核心质量维度(如准确性、完整性、一致性、可扩展性与时效性),并提供多种备选方案以评估数据目录间的兼容性与相似性,同时支持非核心质量维度(如来源、可读性与许可协议)的实施。其目标是赋能数据驱动型组织基于可信且精心编撰的数据资产做出明智决策。本方法对应的源代码可从 https://www.github.com/jorge-martinez-gil/dataq/ 下载。