LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for expertise-intensive tasks. To address this, preliminary attempts are made to enhance planning reliability by incorporating external workflow-related knowledge. Despite the promise, such infused knowledge is mostly disorganized and diverse in formats, lacking rigorous formalization and comprehensive comparisons. Motivated by this, we formalize different formats of workflow knowledge and present FlowBench, the first benchmark for workflow-guided planning. FlowBench covers 51 different scenarios from 6 domains, with knowledge presented in diverse formats. To assess different LLMs on FlowBench, we design a multi-tiered evaluation framework. We evaluate the efficacy of workflow knowledge across multiple formats, and the results indicate that current LLM agents need considerable improvements for satisfactory planning. We hope that our challenging benchmark can pave the way for future agent planning research.
翻译:基于大型语言模型(LLM)的智能体已成为前景广阔的工具,其通过迭代式规划与执行来完成复杂任务。然而,在处理专业知识密集型任务时,若缺乏特定领域知识,这类智能体易产生不符合预期的规划幻觉。为解决此问题,已有初步尝试通过引入外部工作流相关知识来提升规划可靠性。尽管前景可期,此类注入的知识大多格式杂乱且多样,缺乏严谨的形式化定义与全面比较。受此启发,我们对不同形式的工作流知识进行了形式化定义,并提出了首个面向工作流引导规划的基准测试集——FlowBench。FlowBench涵盖6个领域的51个不同场景,其知识以多种格式呈现。为评估不同LLM在FlowBench上的表现,我们设计了一个多层次评估框架。我们验证了多种格式工作流知识的有效性,结果表明当前LLM智能体仍需显著改进才能实现令人满意的规划能力。我们希望这一具有挑战性的基准测试能为未来智能体规划研究开辟道路。