An increasing number of organizations are deploying Large Language Models (LLMs) for a wide range of tasks. Despite their general utility, LLMs are prone to errors, ranging from inaccuracies to hallucinations. To objectively assess the capabilities of existing LLMs, performance benchmarks are conducted. However, these benchmarks often do not translate to more specific real-world tasks. This paper addresses the gap in benchmarking LLM performance in the Business Process Management (BPM) domain. Currently, no BPM-specific benchmarks exist, creating uncertainty about the suitability of different LLMs for BPM tasks. This paper systematically compares LLM performance on four BPM tasks focusing on small open-source models. The analysis aims to identify task-specific performance variations, compare the effectiveness of open-source versus commercial models, and assess the impact of model size on BPM task performance. This paper provides insights into the practical applications of LLMs in BPM, guiding organizations in selecting appropriate models for their specific needs.
翻译:越来越多的组织正在将大语言模型(LLM)部署于广泛的任务中。尽管其具有通用性,但LLM容易产生错误,包括不准确性和幻觉。为客观评估现有LLM的能力,通常会进行性能基准测试。然而,这些基准测试往往无法准确反映更具体的现实世界任务。本文旨在弥补业务流程管理(BPM)领域LLM性能基准测试的空白。目前尚不存在针对BPM的专用基准,这导致对不同LLM在BPM任务中的适用性存在不确定性。本文系统比较了LLM在四项BPM任务上的性能,重点关注小型开源模型。该分析旨在识别任务特定的性能差异,比较开源模型与商业模型的有效性,并评估模型规模对BPM任务性能的影响。本文为LLM在BPM中的实际应用提供了见解,可指导组织根据其特定需求选择合适的模型。