Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.
翻译:数据目录在现代数据驱动型组织中发挥着至关重要的作用,它们有助于发现、理解和利用多样化的数据资产。然而,确保其质量和可靠性是一项复杂的任务,尤其是在开放且大规模的数据环境中。本文提出了一种自动确定开放数据目录质量的框架,旨在满足对高效可靠质量评估机制的需求。我们的框架能够分析准确性、完整性、一致性、可扩展性和时效性等多项核心质量维度,并提供多种替代方案,用于评估此类目录之间的兼容性和相似性,以及实现出处、可读性和许可等非核心质量维度集。其目标是使数据驱动型组织能够基于可信赖且精心管理的数据资产做出明智决策。说明我们方法的源代码可从 https://www.github.com/jorge-martinez-gil/dataq/ 下载。