Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To manage long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
翻译:大型推理模型已展现出强大的问题解决能力,然而现实世界任务通常需要外部工具和长程交互。现有的智能体框架通常遵循预定义的工作流程,这限制了自主性和全局任务完成能力。本文介绍DeepAgent,一种端到端的深度推理智能体,它在单一、连贯的推理过程中执行自主思考、工具发现和动作执行。为管理长程交互,我们引入了一种自主记忆折叠机制,将过去的交互压缩为结构化的情景记忆、工作记忆和工具记忆,从而在保留关键信息的同时减少误差累积。为高效且稳定地教授通用工具使用,我们开发了一种端到端的强化学习策略,即ToolPO,该策略利用LLM模拟的API,并应用工具调用优势归因方法,为工具调用令牌分配细粒度的信用。在八个基准测试(包括通用工具使用任务(ToolBench、API-Bank、TMDB、Spotify、ToolHop)和下游应用(ALFWorld、WebShop、GAIA、HLE))上进行的大量实验表明,DeepAgent在有标签工具和开放集工具检索场景下均持续优于基线方法。代码和演示可在 https://github.com/RUC-NLPIR/DeepAgent 获取。