The emergence of Large Language Models (LLMs) like ChatGPT has inspired the development of LLM-based agents capable of addressing complex, real-world tasks. However, these agents often struggle during task execution due to methodological constraints, such as error propagation and limited adaptability. To address this issue, we propose a multi-agent framework based on dynamic Task Decomposition and Agent Generation (TDAG). This framework dynamically decomposes complex tasks into smaller subtasks and assigns each to a specifically generated subagent, thereby enhancing adaptability in diverse and unpredictable real-world tasks. Simultaneously, existing benchmarks often lack the granularity needed to evaluate incremental progress in complex, multi-step tasks. In response, we introduce ItineraryBench in the context of travel planning, featuring interconnected, progressively complex tasks with a fine-grained evaluation system. ItineraryBench is designed to assess agents' abilities in memory, planning, and tool usage across tasks of varying complexity. Our experimental results reveal that TDAG significantly outperforms established baselines, showcasing its superior adaptability and context awareness in complex task scenarios.
翻译:ChatGPT等大型语言模型(LLMs)的出现催生了能够处理复杂现实任务的基于LLM的智能体。然而,这些智能体在执行任务时常因方法局限性(如错误传播和适应性不足)而面临挑战。为解决这一问题,我们提出了一种基于动态任务分解与智能体生成(TDAG)的多智能体框架。该框架将复杂任务动态分解为更小的子任务,并为每个子任务分配一个专门生成的子智能体,从而增强其在多样且不可预测的现实任务中的适应性。同时,现有基准通常缺乏评估复杂多步骤任务渐进进展所需的粒度。为此,我们在旅行规划场景中引入ItineraryBench基准,该基准包含相互关联且复杂度逐步递增的任务,并配备细粒度的评估体系。ItineraryBench旨在评估智能体在不同复杂度任务中的记忆、规划及工具使用能力。实验结果表明,TDAG显著优于已有基线方法,在复杂任务场景中展现出卓越的适应性和情境感知能力。