Semantic Modelling of Organizational Knowledge as a Basis for Enterprise Data Governance 4.0 -- Application to a Unified Clinical Data Model

Miguel AP Oliveira,Stephane Manara,Bruno Molé,Thomas Muller,Aurélien Guillouche,Lysann Hesske,Bruce Jordan,Gilles Hubert,Chinmay Kulkarni,Pralipta Jagdev,Cedric R. Berger

Individuals and organizations cope with an always-growing amount of data, which is heterogeneous in its contents and formats. An adequate data management process yielding data quality and control over its lifecycle is a prerequisite to getting value out of this data and minimizing inherent risks related to multiple usages. Common data governance frameworks rely on people, policies, and processes that fall short of the overwhelming complexity of data. Yet, harnessing this complexity is necessary to achieve high-quality standards. The latter will condition any downstream data usage outcome, including generative artificial intelligence trained on this data. In this paper, we report our concrete experience establishing a simple, cost-efficient framework that enables metadata-driven, agile and (semi-)automated data governance (i.e. Data Governance 4.0). We explain how we implement and use this framework to integrate 25 years of clinical study data at an enterprise scale in a fully productive environment. The framework encompasses both methodologies and technologies leveraging semantic web principles. We built a knowledge graph describing avatars of data assets in their business context, including governance principles. Multiple ontologies articulated by an enterprise upper ontology enable key governance actions such as FAIRification, lifecycle management, definition of roles and responsibilities, lineage across transformations and provenance from source systems. This metadata model is the keystone to data governance 4.0: a semi-automatised data management process that considers the business context in an agile manner to adapt governance constraints to each use case and dynamically tune it based on business changes.

翻译：个人与组织持续应对着日益增长的数据，这些数据在内容与格式上呈现高度异质性。确保数据质量并实现生命周期管控的恰当数据管理流程，是数据价值释放及降低多场景使用固有风险的前提。传统数据治理框架依赖人员、策略与流程，已难以应对数据本身的极度复杂性。然而，驾驭这种复杂性是实现高质量标准的必要条件——这将直接影响包括基于该数据训练生成式人工智能在内的所有下游数据应用成效。本文报告了我们建立简易、低成本框架的具体实践经验，该框架支持元数据驱动、敏捷且（半）自动化的数据治理（即数据治理4.0）。我们阐述了如何在完全生产环境中，于企业级规模上实施并运用该框架整合25年临床研究数据。该框架融合了方法论与技术体系，核心采用语义网原则。通过构建知识图谱，我们描述了数据资产在其业务上下文中的化身形态，并纳入治理原则。由企业上层本体串联的多个本体支撑起FAIR化、生命周期管理、角色与职责定义、跨转换环节的数据血缘追踪以及源系统数据溯源等关键治理行动。这一元数据模型构成了数据治理4.0的基石：一种半自动化的数据管理流程，能敏捷考虑业务上下文，为每个用例适配治理约束，并基于业务变化动态调整。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日