Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks

Michael Wornow,Avanika Narayan,Ben Viggiano,Ishan S. Khare,Tathagat Verma,Tibor Thompson,Miguel Angel Fuentes Hernandez,Sudharsan Sundar,Chloe Trujillo,Krrish Chawla,Rongfei Lu,Justin Shen,Divya Nagaraj,Joshua Martinez,Vardhan Agrawal,Althea Hudson,Nigam H. Shah,Christopher Re

Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread

翻译：现有机器学习基准测试缺乏评估模型在业务流程管理任务上表现所需的深度和多样性标注。业务流程管理是指对企业工作流程进行记录、度量、改进和自动化的实践。然而，现有研究几乎完全集中于一项任务——即使用基于多模态基础模型（如GPT-4）的智能体实现端到端全自动化。这种对自动化的关注忽视了当前大多数BPM工具的应用现实——仅记录相关流程就占用了典型流程优化项目60%的时间。为弥补这一空白，我们提出了WONDERBREAD，这是首个超越自动化范畴、针对BPM任务评估多模态基础模型的基准测试。我们的贡献包括：（1）包含2928个已记录工作流程演示的数据集；（2）6项源自实际应用场景的新型BPM任务，涵盖从工作流程记录到知识迁移再到流程改进的范畴；（3）自动化评估框架。我们的基准测试表明，虽然最先进的基础模型能够自动生成文档（例如能回忆工作流程视频演示中88%的操作步骤），但在将知识重新应用于工作流程完成的细粒度验证方面表现欠佳（F1分数 < 0.3）。我们希望WONDERBREAD能促进开发更多"以人为本"的企业级AI工具，并推动多模态基础模型在更广泛BPM任务领域的探索。我们在此发布数据集与实验：https://github.com/HazyResearch/wonderbread