Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1% on easy and 4.7% on hard questions. We further test SEER on $τ$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44% and 23.38%, respectively.
翻译:函数调用使大语言模型(LLMs)能够通过利用工具和API与外部系统交互。然而,在面对多步骤工具使用时,LLMs在工具选择、参数生成和工具链规划方面仍存在困难。现有方法通常依赖于人工设计针对特定任务的演示示例,或从精心构建的库中进行检索。这些方法需要大量的专家投入,并且随着工具多样性和任务复杂性的增加,提示工程变得日益复杂和低效。为解决这些挑战,我们提出一种自引导方法——逐步经验回溯(SEER),该方法从一个持续更新的经验池中进行细粒度、逐步的检索。SEER不依赖静态或人工维护的库,而是通过将过去成功的执行轨迹增量式地扩充到经验池中,从而实现经验池的持续扩展和模型性能的随时间提升。在ToolQA基准测试上的评估表明,SEER在简单问题和困难问题上分别实现了平均6.1%和4.7%的性能提升。我们进一步在$τ$-bench上测试了SEER,该基准包含两个现实领域。基于Qwen2.5-7B和Qwen2.5-72B模型,SEER分别展示了7.44%和23.38%的显著准确率提升。