Process Extraction from Text: Benchmarking the State of the Art and Paving the Way for Future Challenges

The extraction of process models from text refers to the problem of turning the information contained in an unstructured textual process descriptions into a formal representation,i.e.,a process model. Several automated approaches have been proposed to tackle this problem, but they are highly heterogeneous in scope and underlying assumptions,i.e., differences in input, target output, and data used in their evaluation.As a result, it is currently unclear how well existing solutions are able to solve the model-extraction problem and how they compare to each other.We overcome this issue by comparing 10 state-of-the-art approaches for model extraction in a systematic manner, covering both qualitative and quantitative aspects.The qualitative evaluation compares the analysis of the primary studies on: 1 the main characteristics of each solution;2 the type of process model elements extracted from the input data;3 the experimental evaluation performed to evaluate the proposed framework.The results show a heterogeneity of techniques, elements extracted and evaluations conducted, that are often impossible to compare.To overcome this difficulty we propose a quantitative comparison of the tools proposed by the papers on the unifying task of process model entity and relation extraction so as to be able to compare them directly.The results show three distinct groups of tools in terms of performance, with no tool obtaining very good scores and also serious limitations.Moreover, the proposed evaluation pipeline can be considered a reference task on a well-defined dataset and metrics that can be used to compare new tools. The paper also presents a reflection on the results of the qualitative and quantitative evaluation on the limitations and challenges that the community needs to address in the future to produce significant advances in this area.

翻译：文本中流程模型的提取是指将非结构化文本流程描述中的信息转化为正式表示（即流程模型）的问题。已有多种自动化方法被提出以解决该问题，但这些方法在范围及基本假设（即输入、目标输出及评估所使用的数据）方面存在高度异质性。因此，目前尚不清楚现有解决方案在多大程度上能有效解决模型提取问题，以及它们之间如何相互对比。我们通过系统性地比较10种最新模型提取方法克服了这一挑战，涵盖了定性和定量两个维度。定性评估比较了主要研究在以下方面的分析：（1）每种解决方案的主要特征；（2）从输入数据中提取的流程模型元素类型；（3）为评估所提框架而进行的实验评估。结果表明，技术、提取的元素以及所进行的评估存在异质性，且往往难以直接比较。为克服这一困难，我们提出在流程模型实体与关系提取这一统一任务上，对论文中提出的工具进行定量比较，从而能够直接对比它们。结果表明，这些工具在性能上分为三个不同的组别，且没有任何工具获得非常高的分数，同时还存在严重局限性。此外，所提出的评估流程可被视为一个参考任务，基于定义明确的数据集和指标，可用于比较新工具。本文还基于定性和定量评估结果，反思了该领域未来需解决的局限性与挑战，以推动重大进展。