Multi-agent LLM systems that generate structured workflows from natural-language requests are now deployed in production across cloud automation, DevOps, and enterprise process orchestration. Operating such systems exposes a recurring change-management problem. Routine updates, such as re-running the same input, swapping the underlying LLM, or refactoring an agent's prompt or orchestration code, frequently produce workflows that differ substantially from previously validated references. Engineers are then left without a principled way to decide whether a change is safe to ship. Automatic workflow evaluation is the natural tool for answering this question. In practice, however, metric scores are poorly calibrated, and a numeric change rarely communicates the severity of the underlying degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics by applying realistic, graded perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores in change-management settings. Our dataset will be released upon acceptance.
翻译:生成从自然语言请求到结构化工作流的多智能体大语言模型系统现已部署于云端自动化、DevOps及企业流程编排等生产环境中。此类系统的运维暴露出一个反复出现的变更管理问题:常规更新(如重跑相同输入、替换底层大语言模型、重构智能体提示词或编排代码)常会生成与先前验证基准存在显著差异的工作流。工程师因此缺乏判断变更是否安全可部署的原则性方法。自动化工作流评估本是解决该问题的自然工具,但在实践中,指标分数校准不足,数值变化难以传达底层性能退化的严重程度。我们提出WorkflowPerturb——通过对黄金工作流施加真实、渐进式扰动来研究工作流评估指标的受控基准。该基准包含4,973条黄金工作流及44,757个扰动变体,涵盖三类扰动(步骤缺失、步骤压缩、描述变更),每类扰动按10%、30%、50%的严重程度梯度施加。我们评估了多种指标族,并通过预期得分轨迹与残差分析其敏感性与校准性。研究结果揭示了指标族间的系统性差异,并在变更管理场景中支持对工作流评估得分进行严重性感知解读。我们的数据集将在接收后公开。