Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs

Agents based on large language models (LLMs) have demonstrated effectiveness in solving a wide range of tasks by integrating LLMs with key modules such as planning, memory, and tool usage. Increasingly, customers are adopting LLM agents across a variety of commercial applications critical to reliability, including support for mental well-being, chemical synthesis, and software development. Nevertheless, our observations and daily use of LLM agents indicate that they are prone to making erroneous plans, especially when the tasks are complex and require long-term planning. In this paper, we propose PDoctor, a novel and automated approach to testing LLM agents and understanding their erroneous planning. As the first work in this direction, we formulate the detection of erroneous planning as a constraint satisfiability problem: an LLM agent's plan is considered erroneous if its execution violates the constraints derived from the user inputs. To this end, PDoctor first defines a domain-specific language (DSL) for user queries and synthesizes varying inputs with the assistance of the Z3 constraint solver. These synthesized inputs are natural language paragraphs that specify the requirements for completing a series of tasks. Then, PDoctor derives constraints from these requirements to form a testing oracle. We evaluate PDoctor with three mainstream agent frameworks and two powerful LLMs (GPT-3.5 and GPT-4). The results show that PDoctor can effectively detect diverse errors in agent planning and provide insights and error characteristics that are valuable to both agent developers and users. We conclude by discussing potential alternative designs and directions to extend PDoctor.

翻译：基于大语言模型的智能体通过将LLM与规划、记忆和工具使用等关键模块相结合，在解决各类任务中展现出有效性。越来越多的客户在涉及可靠性的商业应用中采用LLM智能体，包括心理健康支持、化学合成和软件开发等领域。然而，我们的观察与日常使用表明，LLM智能体在任务复杂且需长期规划时易产生错误规划。本文提出PDoctor——一种用于测试LLM智能体并理解其错误规划的新型自动化方法。作为该方向的首项工作，我们将错误规划检测形式化为约束可满足性问题：若智能体规划的执行违反从用户输入推导出的约束，则视其为错误规划。为此，PDoctor首先为用户查询定义领域特定语言（DSL），并借助Z3约束求解器合成多样化输入。这些合成输入是自然语言段落，用于指定完成一系列任务的需求。随后，PDoctor从这些需求中推导出约束以形成测试预言。我们使用三种主流智能体框架和两个强大LLM（GPT-3.5与GPT-4）评估PDoctor。结果表明，PDoctor能有效检测智能体规划中的多种错误，并提供对智能体开发者和用户均具价值的错误特征与洞见。最后，我们讨论了扩展PDoctor的潜在替代设计与研究方向。