Learning optimal policies from event logs through reinforcement learning: a comparison of deep and MDP-based approaches

Prescriptive Process Monitoring is an emerging area within Process Mining that focuses on recommending actions to optimize business outcomes. Most existing works prescribe pre-defined interventions, i.e., sets of actions applied to ongoing process executions to achieve a specific objective or Key Performance Indicator (KPI). In contrast, only a few approaches have explored learning and evaluating optimal behavioral policies, i.e., general strategies that determine the best sequence of actions to maximize a desired KPI. In this paper, we address the problem of learning optimal behavioral policies by proposing an AI-based approach that learns an optimal policy directly from historical process executions using Reinforcement Learning (RL) to recommend the best actions for optimizing a KPI. To this end, we employ two RL techniques. The first is a classical model-based approach that extends previous work by the authors through the construction of a Markov Decision Process (MDP) capturing process behavior. The second is a model-free technique based on offline Deep RL. Unlike state-of-the-art work, we aim to minimize the use of domain knowledge and learn optimal policies directly from historical event data. This allows us to learn when to apply interventions and discover effective ones directly from data. Moreover, we target complex scenarios involving external actors, where the process owner controls only part of the activities. We adopt a data-driven Business Process Simulation (BPS) environment to evaluate the learned policies. Results show that both methods improve the targeted KPI with similar effectiveness, while the model-based approach outperforms offline Deep RL in computational efficiency.

翻译：规范流程监控是流程挖掘领域的一个新兴分支，专注于推荐行动以优化业务成果。现有工作大多预定义干预措施，即针对正在执行的流程采取一系列行动，以实现特定目标或关键绩效指标（KPI）。相比之下，仅少数研究探索了学习与评估最优行为策略，即确定最大化期望KPI的最佳行动序列的通用策略。本文通过提出一种基于人工智能的方法，直接从历史流程执行中利用强化学习（RL）学习最优策略，以推荐优化KPI的最佳行动，从而解决最优行为策略的学习问题。为此，我们采用了两种RL技术：第一种是基于经典模型的方法，通过构建捕捉流程行为的马尔可夫决策过程（MDP）扩展了作者前期工作；第二种是基于离线深度RL的无模型技术。与当前先进研究不同，我们旨在最小化领域知识的运用，直接从历史事件数据中学习最优策略，从而能学习何时实施干预并发现数据中有效的干预手段。此外，我们针对涉及外部参与者的复杂场景，其中流程所有者仅控制部分活动。我们采用数据驱动的业务流程模拟（BPS）环境评估所学策略。结果表明，两种方法均能有效提升目标KPI且效果相近，而基于模型的方法在计算效率上优于离线深度RL方法。