Individuals and organizations cope with an always-growing data amount, heterogeneous in contents and formats. A prerequisite to get value out this data and minimise inherent risks related to multiple usages is an adequate data management process yielding data quality and control over its lifecycle. Common data governance frameworks relying on people, policies and processes falls short of the overwhelming data complexity. Yet, harnessing this complexity is necessary to achieve high quality standards. The later will condition the outcome of any downstream data usage, including generative artificial intelligence trained on this data. In this paper, we report our concrete experience establishing a simple, cost-efficient framework, that enables metadata-driven, agile and (semi-)automated data governance (i.e. Data Governance 4.0). We explain how we implement and use this framework to integrate 25 years of clinical study data at enterprise scale, in a fully productive environment. The framework encompasses both methodologies and technologies leveraging semantic web principles. We built a knowledge graph describing avatars of data assets in their business context including governance principles. Multiple ontologies articulated by an enterprise upper ontology enable key governance actions such as FAIRification, lifecycle management, definition of roles and responsibilities, lineage across transformations and provenance from source systems. This metadata model is the keystone to data governance 4.0: a semi-automatized data management process, taking in account the business context in an agile manner to adapt governance constraints to each use case and dynamically tune it based on business changes.
翻译:个人和组织面临着持续增长、内容和格式各异的数据量。要从这些数据中获取价值并将多用途相关的固有风险降至最低,前提是具备充分的数据管理流程,以实现数据质量并控制其生命周期。仅仅依赖人员、策略和流程的通用数据治理框架已无法应对日益复杂的数据挑战。然而,驾驭这种复杂性是实现高质量标准的必要条件,这将对任何下游数据使用(包括基于这些数据训练的生成式人工智能)的结果产生决定性影响。本文报告了我们在构建一个简单、高性价比框架方面的具体实践经验,该框架支持元数据驱动、敏捷且(半)自动化的数据治理(即数据治理4.0)。我们阐述了如何在完全生产化的环境中,在企业范围内实施并运用该框架整合25年来的临床研究数据。该框架融合了利用语义网原理的方法论与技术体系。我们构建了一个知识图谱,用于描述数据资产在其业务上下文中的数字化表征(包括治理原则)。通过企业上层本体实现多个本体之间的关联,从而支持FAIR化、生命周期管理、角色与职责定义、跨转换的数据沿袭以及从源系统的数据来源追溯等关键治理活动。该元数据模型是数据治理4.0的基石:一个半自动化的数据管理流程,能够以敏捷方式考虑业务上下文,将治理约束适配到每个用例,并根据业务变化动态调整治理策略。