Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5\%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.
翻译:大型语言模型在结合外部工具方面展现出巨大潜力,但在复杂多轮工具调用场景中仍面临显著挑战。它们通常表现出规划能力薄弱、工具幻觉、参数生成错误等问题,且难以实现鲁棒的交互。为应对这些挑战,我们提出PEARL这一创新框架,旨在增强大型语言模型在复杂工具使用中的规划与执行能力。PEARL采用两阶段方法:离线阶段中智能体通过探索工具学习有效使用模式与失败条件;在线阶段则采用强化学习策略。在线阶段中,专用规划器通过群体相对策略优化(GRPO)进行训练,并辅以精心设计的奖励函数,该函数可为规划质量提供差异化反馈信号。在ToolHop与T-Eval基准测试上的实验表明,PEARL显著优于现有方法,在ToolHop上实现了\textbf{56.5\%}的最新最优成功率,同时保持较低的工具调用错误率。本工作标志着在应对工具使用的复杂规划挑战方面取得关键进展,为开发更鲁棒可靠的大型语言模型智能体作出贡献。