Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.
翻译:数据目录在现代数据驱动型组织中发挥着关键作用,有助于发现、理解和利用多样化的数据资产。然而,确保其质量和可靠性是一项复杂的任务,尤其在开放且大规模的数据环境中。本文提出了一种自动确定开放数据目录质量的框架,旨在解决对高效可靠质量评估机制的需求。该框架可分析准确性、完整性、一致性、可扩展性和时效性等核心质量维度,提供多种替代方案来评估此类目录间的兼容性和相似性,同时实现来源、可读性和许可等非核心质量维度的评估。目标是使数据驱动型组织能够基于值得信赖且精心管理的数据资产做出明智决策。展示本方法的源代码可从 https://www.github.com/jorge-martinez-gil/dataq/ 下载。