Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose {\our{}}, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces -- particularly failed runs -- and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate {\our{}} on mathematical reasoning and code generation benchmarks, where {\our{}} achieves superior performance and efficiency compared to existing methods. The source code is publicly available at https://github.com/ma-zihan/JudgeFlow.
翻译:优化基于大语言模型(LLM)的智能体工作流对于扩展人工智能能力具有挑战性。现有方法依赖于粗粒度的端到端评估信号,缺乏指导优化方向的细粒度信号,往往导致低效或影响甚微的修改。为克服这些局限,我们提出 {\our{}},一种评估-判定-优化-更新的流程框架。我们将可复用、可配置的逻辑块引入智能体工作流,以捕捉基础逻辑形式。在此抽象基础上,我们设计了一个专用的判定模块,该模块通过检查执行轨迹(特别是失败运行)为问题逻辑块分配基于排序的责任分数。这些细粒度的诊断信号随后被基于 LLM 的优化器所利用,使其能够聚焦于工作流中问题最严重的模块进行修改。我们的方法提高了样本效率,通过块级诊断增强了可解释性,并为自动化日益复杂的智能体工作流提供了可扩展的基础。我们在数学推理和代码生成基准测试上评估了 {\our{}},结果表明 {\our{}} 在性能与效率上均优于现有方法。源代码已公开于 https://github.com/ma-zihan/JudgeFlow。