The creation of Business Process Model and Notation (BPMN) models is a complex and time-consuming task requiring both domain knowledge and proficiency in modeling conventions. Recent advances in large language models (LLMs) have significantly expanded the possibilities for generating BPMN models directly from natural language, building upon earlier text-to-process methods with enhanced capabilities in handling complex descriptions. However, there is a lack of systematic evaluations of LLM-generated process models. Current efforts either use LLM-as-a-judge approaches or do not consider established dimensions of model quality. To this end, we introduce BEF4LLM, a novel LLM evaluation framework comprising four perspectives: syntactic quality, pragmatic quality, semantic quality, and validity. Using BEF4LLM, we conduct a comprehensive analysis of open-source LLMs and benchmark their performance against human modeling experts. Results indicate that LLMs excel in syntactic and pragmatic quality, while humans outperform in semantic aspects; however, the differences in scores are relatively modest, highlighting LLMs' competitive potential despite challenges in validity and semantic quality. The insights highlight current strengths and limitations of using LLMs for BPMN modeling and guide future model development and fine-tuning. Addressing these areas is essential for advancing the practical deployment of LLMs in business process modeling.
翻译:业务流程模型与符号(BPMN)模型的创建是一项复杂且耗时的任务,既需要领域知识,又要求熟练掌握建模规范。近年来,大型语言模型(LLMs)的进展显著扩展了直接从自然语言生成BPMN模型的可能性,这建立在早期文本到流程方法的基础上,并增强了处理复杂描述的能力。然而,目前缺乏对LLM生成的流程模型的系统性评估。现有研究要么采用LLM作为评判者的方法,要么未考虑模型质量的既定维度。为此,我们提出了BEF4LLM,一个新颖的LLM评估框架,包含四个维度:句法质量、语用质量、语义质量和有效性。利用BEF4LLM,我们对开源LLMs进行了全面分析,并将其性能与人类建模专家进行了基准比较。结果表明,LLMs在句法和语用质量方面表现优异,而人类在语义方面更胜一筹;然而,得分差异相对较小,这凸显了LLMs尽管在有效性和语义质量方面面临挑战,但仍具备竞争潜力。这些见解揭示了当前使用LLMs进行BPMN建模的优势与局限,并为未来模型的开发和微调提供了指导。解决这些领域的问题对于推动LLMs在业务流程建模中的实际部署至关重要。