Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.
翻译:数据目录在现代数据驱动型组织中发挥着关键作用,通过促进多样化数据资产的发现、理解和利用来支撑业务。然而,在开放且大规模的数据环境中,确保目录质量与可靠性是一项复杂挑战。本文提出一种自动判定开放数据目录质量的框架,旨在解决高效可靠的质量评估机制需求。该框架能够分析多项核心质量维度(如准确性、完整性、一致性、可扩展性及时效性),为目录间的兼容性与相似性评估提供多种可选方案,并实现一组非核心质量维度(如来源追溯性、可读性及许可协议)。目标在于赋能数据驱动型组织,使其能基于可信且精心整理的数据资产做出明智决策。说明我们方法的源代码可从 https://www.github.com/jorge-martinez-gil/dataq/ 下载。