At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
翻译:当前,可执行可视化工作流已成为实际工业部署中的主流范式,具有高可靠性与可控性。然而在实践中,此类工作流几乎完全依赖人工构建:开发者需精心设计工作流、为每个步骤编写提示词,并随需求演变反复修正逻辑——这使得开发过程成本高昂、耗时且易出错。为探究大语言模型能否自动化这一多轮交互过程,我们提出Chat2Workflow——一个直接从自然语言生成可执行可视化工作流的基准,并提出鲁棒的智能体基线以提升性能。该基准基于大规模真实业务工作流构建,每个实例的设计确保生成的工作流可转换并直接部署至Dify、Coze等实际工作流平台。实验结果表明,尽管最先进的语言模型通常能捕捉高层意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其面对复杂且动态变化的需求时。尽管我们的智能体基线实现了最高6.05%的解决率提升,但现实差距的持续存在使Chat2Workflow成为推动工业级自动化发展的基础。代码见https://github.com/zjunlp/Chat2Workflow。