Object-centric process mining is emerging as a promising paradigm across diverse industries, drawing substantial academic attention. To support its data requirements, existing object-centric data formats primarily facilitate the exchange of static event logs between data owners, researchers, and analysts, rather than serving as a robust foundational data model for continuous data ingestion and transformation pipelines for subsequent storage and analysis. This focus results into suboptimal design choices in terms of flexibility, scalability, and maintainability. For example, it is difficult for current object-centric event log formats to deal with novel object types or new attributes in case of streaming data. This paper proposes a database format designed for an intermediate data storage hub, which segregates process mining applications from their data sources using a hub-and-spoke architecture. It delineates essential requirements for robust object-centric event log storage from a data engineering perspective and introduces a novel relational schema tailored to these requirements. To validate the efficacy of the proposed database format, an end-to-end solution is implemented using a lightweight, open-source data stack. Our implementation includes data extractors for various object-centric event log formats, automated data quality assessments, and intuitive process data visualization capabilities.
翻译:面向对象过程挖掘作为一种新兴的范式,正在各行业展现出广阔前景并引起学术界的广泛关注。为满足其数据需求,现有的面向对象数据格式主要服务于数据所有者、研究人员和分析师之间的静态事件日志交换,而非作为持续数据摄取与转换管道的基础数据模型,以支持后续存储与分析。这种设计导向导致其在灵活性、可扩展性和可维护性方面存在局限。例如,当前面向对象的事件日志格式难以处理流数据场景中新增的对象类型或属性。本文提出一种面向中间数据存储枢纽的数据库格式,采用中心辐射型架构将过程挖掘应用与数据源解耦。研究从数据工程视角阐明了稳健的面向对象事件日志存储的核心需求,并据此设计了一种新型关系型数据模式。为验证所提数据库格式的有效性,我们基于轻量级开源数据栈实现了端到端解决方案。该实现包含多类面向对象事件日志格式的数据提取器、自动化数据质量评估模块以及直观的过程数据可视化功能。