Machine learning models are routinely integrated into process mining pipelines to carry out tasks like data transformation, noise reduction, anomaly detection, classification, and prediction. Often, the design of such models is based on some ad-hoc assumptions about the corresponding data distributions, which are not necessarily in accordance with the non-parametric distributions typically observed with process data. Moreover, the learning procedure they follow ignores the constraints concurrency imposes to process data. Data encoding is a key element to smooth the mismatch between these assumptions but its potential is poorly exploited. In this paper, we argue that a deeper insight into the issues raised by training machine learning models with process data is crucial to ground a sound integration of process mining and machine learning. Our analysis of such issues is aimed at laying the foundation for a methodology aimed at correctly aligning machine learning with process mining requirements and stimulating the research to elaborate in this direction.
翻译:机器学习模型已被常规集成至流程挖掘流水线中,用于执行数据转换、降噪、异常检测、分类及预测等任务。这些模型的设计通常基于对相应数据分布的一些特设假设,但这些假设未必与流程数据中常见的非参数分布特征相符。此外,其所采用的学习过程忽略了并发性对流程数据施加的约束。数据编码是缓解这些假设偏差的关键环节,但其潜力仍未得到充分发掘。本文认为,深入洞察使用流程数据训练机器学习模型所引发的核心问题,对于夯实流程挖掘与机器学习的合理融合至关重要。我们对此类问题的分析旨在为建立一套正确对齐机器学习与流程挖掘需求的方法论奠定基础,并激励相关研究沿此方向深入展开。