Individuals and organizations cope with an always-growing amount of data, which is heterogeneous in its contents and formats. An adequate data management process yielding data quality and control over its lifecycle is a prerequisite to getting value out of this data and minimizing inherent risks related to multiple usages. Common data governance frameworks rely on people, policies, and processes that fall short of the overwhelming complexity of data. Yet, harnessing this complexity is necessary to achieve high-quality standards. The latter will condition any downstream data usage outcome, including generative artificial intelligence trained on this data. In this paper, we report our concrete experience establishing a simple, cost-efficient framework that enables metadata-driven, agile and (semi-)automated data governance (i.e. Data Governance 4.0). We explain how we implement and use this framework to integrate 25 years of clinical study data at an enterprise scale in a fully productive environment. The framework encompasses both methodologies and technologies leveraging semantic web principles. We built a knowledge graph describing avatars of data assets in their business context, including governance principles. Multiple ontologies articulated by an enterprise upper ontology enable key governance actions such as FAIRification, lifecycle management, definition of roles and responsibilities, lineage across transformations and provenance from source systems. This metadata model is the keystone to data governance 4.0: a semi-automatised data management process that considers the business context in an agile manner to adapt governance constraints to each use case and dynamically tune it based on business changes.
翻译:个人与组织持续应对着日益增长的数据,这些数据在内容与格式上呈现高度异质性。确保数据质量并实现生命周期管控的恰当数据管理流程,是数据价值释放及降低多场景使用固有风险的前提。传统数据治理框架依赖人员、策略与流程,已难以应对数据本身的极度复杂性。然而,驾驭这种复杂性是实现高质量标准的必要条件——这将直接影响包括基于该数据训练生成式人工智能在内的所有下游数据应用成效。本文报告了我们建立简易、低成本框架的具体实践经验,该框架支持元数据驱动、敏捷且(半)自动化的数据治理(即数据治理4.0)。我们阐述了如何在完全生产环境中,于企业级规模上实施并运用该框架整合25年临床研究数据。该框架融合了方法论与技术体系,核心采用语义网原则。通过构建知识图谱,我们描述了数据资产在其业务上下文中的化身形态,并纳入治理原则。由企业上层本体串联的多个本体支撑起FAIR化、生命周期管理、角色与职责定义、跨转换环节的数据血缘追踪以及源系统数据溯源等关键治理行动。这一元数据模型构成了数据治理4.0的基石:一种半自动化的数据管理流程,能敏捷考虑业务上下文,为每个用例适配治理约束,并基于业务变化动态调整。