Existing Tool-Integrated Reasoning (TIR) models have effectively extended the question-answering capabilities of LLMs by incorporating external tools. However, real-world scenarios present numerous open-ended problems where fixed tools often fail to meet task requirements. Furthermore, the lack of self-optimization mechanisms means that erroneous tool outputs can mislead the LLM's responses. Additionally, the construction of existing tools entails significant manual effort, which consequently constrains their applicability. Recognizing that the reasoning traces of LLMs encapsulate implicit problem-solving capabilities, we propose UCT, a novel training-free framework that transforms agents from tool users to tool creators. This approach harvests reasoning experiences and distills them into reusable assets. This method transforms the agent from a mere tool user into a tool creator, enabling adaptive tool creation and self-updating during the inference process. We also introduce a memory consolidation mechanism to maintain the tool library, ensuring high reusability of retained experiential memory for subsequent reasoning tasks. This novel automated tool construction paradigm continuously improves tool quality during reasoning, allowing the overall agent system to progress without additional training. Extensive experiments demonstrate that our method serves as a novel paradigm for enhancing the capabilities of TIR models. In particular, the significant performance gains achieved +20.86%$\uparrow$ and +23.04%$\uparrow$ on benchmarks across multi-domain mathematical and scientific reasoning tasks validate the self-evolving capability of the agent.
翻译:现有的工具集成推理模型通过整合外部工具,有效扩展了大型语言模型的问答能力。然而,现实场景中存在大量开放性问题,固定工具往往无法满足任务需求。此外,由于缺乏自优化机制,错误的工具输出可能误导大型语言模型的响应。同时,现有工具的构建需要大量人工投入,这限制了其应用范围。认识到大型语言模型的推理轨迹蕴含着隐式问题解决能力,我们提出UCT——一种新颖的免训练框架,将智能体从工具使用者转变为工具创造者。该方法通过采集推理经验并将其提炼为可复用资产,使智能体在推理过程中能够自适应地创建工具并进行自我更新。我们还引入了记忆巩固机制来维护工具库,确保保留的经验记忆在后续推理任务中具有高复用性。这种创新的自动化工具构建范式在推理过程中持续提升工具质量,使得整个智能体系统无需额外训练即可持续进化。大量实验表明,我们的方法为增强工具集成推理模型能力提供了新范式。特别是在多领域数学与科学推理基准测试中实现的显著性能提升——分别达到+20.86%↑和+23.04%↑,验证了智能体的自我进化能力。