WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

Recent advancements in large language models (LLMs) have driven a revolutionary paradigm shift in process automation from Robotic Process Automation to Agentic Process Automation by automating the workflow orchestration procedure based on LLMs. However, existing LLMs (even the advanced OpenAI GPT-4o) are confined to achieving satisfactory capability in workflow orchestration. To address this limitation, we present WorkflowLLM, a data-centric framework elaborately designed to enhance the capability of LLMs in workflow orchestration. It first constructs a large-scale fine-tuning dataset WorkflowBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories. Specifically, the construction process can be divided into three phases: (1) Data Collection: we collect real-world workflow data from Apple Shortcuts and RoutineHub, transcribing them into Python-style code. We further equip them with generated hierarchical thought via ChatGPT. (2) Query Expansion: we prompt ChatGPT to generate more task queries to enrich the diversity and complexity of workflows. (3) Workflow Generation: we leverage an annotator model trained on collected data to generate workflows for synthesized queries. Finally, we merge the synthetic samples that pass quality confirmation with the collected samples to obtain the WorkflowBench. Based on WorkflowBench, we fine-tune Llama-3.1-8B to obtain WorkflowLlama. Our experiments show that WorkflowLlama demonstrates a strong capacity to orchestrate complex workflows, while also achieving notable generalization performance on previously unseen APIs. Additionally, WorkflowBench exhibits robust zero-shot generalization capabilities on an out-of-distribution task planning dataset, T-Eval. Our data and code are available at https://github.com/OpenBMB/WorkflowLLM.

翻译：近年来，大语言模型（LLMs）的进展推动了流程自动化从机器人流程自动化（RPA）向智能体流程自动化（APA）的革命性范式转变，其核心在于基于LLMs实现工作流编排过程的自动化。然而，现有的大语言模型（即使是先进的OpenAI GPT-4o）在工作流编排能力上仍难以达到令人满意的水平。为应对这一局限，我们提出了WorkflowLLM，一个以数据为中心的框架，旨在精心增强LLMs的工作流编排能力。该框架首先构建了一个包含106,763个样本的大规模微调数据集WorkflowBench，覆盖了28个类别、83个应用程序中的1,503个API。具体而言，数据构建过程可分为三个阶段：（1）数据收集：我们从Apple Shortcuts和RoutineHub收集真实世界的工作流数据，并将其转录为Python风格的代码。我们进一步通过ChatGPT为这些数据生成层次化思考过程。（2）查询扩展：我们提示ChatGPT生成更多任务查询，以丰富工作流的多样性和复杂性。（3）工作流生成：我们利用基于收集数据训练的标注模型，为合成的查询生成工作流。最后，我们将通过质量确认的合成样本与收集的样本合并，得到最终的WorkflowBench。基于WorkflowBench，我们对Llama-3.1-8B进行微调，得到了WorkflowLlama。实验表明，WorkflowLlama在编排复杂工作流方面展现出强大能力，同时在未见过的API上也取得了显著的泛化性能。此外，WorkflowBench在分布外任务规划数据集T-Eval上表现出稳健的零样本泛化能力。我们的数据与代码公开于https://github.com/OpenBMB/WorkflowLLM。