Beyond Rule-based Named Entity Recognition and Relation Extraction for Process Model Generation from Natural Language Text

Process-aware information systems offer extensive advantages to companies, facilitating planning, operations, and optimization of day-to-day business activities. However, the time-consuming but required step of designing formal business process models often hampers the potential of these systems. To overcome this challenge, automated generation of business process models from natural language text has emerged as a promising approach to expedite this step. Generally two crucial subtasks have to be solved: extracting process-relevant information from natural language and creating the actual model. Approaches towards the first subtask are rule based methods, highly optimized for specific domains, but hard to adapt to related applications. To solve this issue, we present an extension to an existing pipeline, to make it entirely data driven. We demonstrate the competitiveness of our improved pipeline, which not only eliminates the substantial overhead associated with feature engineering and rule definition, but also enables adaptation to different datasets, entity and relation types, and new domains. Additionally, the largest available dataset (PET) for the first subtask, contains no information about linguistic references between mentions of entities in the process description. Yet, the resolution of these mentions into a single visual element is essential for high quality process models. We propose an extension to the PET dataset that incorporates information about linguistic references and a corresponding method for resolving them. Finally, we provide a detailed analysis of the inherent challenges in the dataset at hand.

翻译：面向过程的信息系统为企业带来了诸多优势，促进了日常业务活动的规划、执行与优化。然而，设计正式业务流程模型这一耗时但必要的步骤，常常阻碍了这些系统潜能的发挥。为克服这一挑战，从自然语言文本中自动生成业务流程模型已成为一种加速该步骤的前瞻性方法。通常需解决两个关键子任务：从自然语言中提取过程相关信息，以及创建实际模型。针对第一个子任务的传统方法多基于规则，虽在特定领域高度优化，却难以适应相关应用。为解决此问题，我们提出对现有流水线进行扩展，使其完全数据驱动。我们展示了改进后流水线的竞争力：不仅消除了特征工程与规则定义带来的显著开销，还能适应不同数据集、实体与关系类型以及新领域。此外，首个子任务中最大的可用数据集（PET）未包含过程描述中实体提及之间的语言参照信息。然而，将这些提及解析为单一可视化元素对生成高质量过程模型至关重要。我们提出对PET数据集进行扩展，纳入语言参照信息及相应的解析方法。最后，我们对该数据集面临的固有挑战进行了详细分析。