LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores. Our dataset will be released upon acceptance.
翻译:基于大语言模型(LLM)的系统越来越多地为复杂任务生成结构化工作流。在实践中,对这些工作流进行自动评估是困难的,因为指标分数通常未经过校准,且分数的变化不能直接反映工作流性能下降的严重程度。我们提出了WorkflowPerturb,一个用于研究工作流评估指标的受控基准。其工作原理是对黄金工作流施加真实、受控的扰动。WorkflowPerturb包含4,973个黄金工作流和44,757个扰动变体,涵盖三种扰动类型(步骤缺失、步骤压缩和描述变更),每种类型均在10%、30%和50%的严重程度级别上施加。我们对多个指标族进行了基准测试,并使用预期分数轨迹和残差分析了它们的敏感性和校准情况。我们的结果揭示了不同指标族之间的系统性差异,并支持对工作流评估分数进行严重程度感知的解释。我们的数据集将在论文被接受后发布。