VerlTool：迈向支持工具使用的整体智能体强化学习 (VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use)

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.

翻译：可验证奖励强化学习（RLVR）在提升大语言模型推理能力方面已展现出成效，但其仍局限于单轮交互且未集成工具使用。尽管近期出现了旨在解决多轮工具交互问题的智能体强化学习与工具使用（ARLT）方法，但现有工作开发的任务特定代码库存在碎片化、同步执行瓶颈以及跨领域可扩展性有限等问题。这些低效性阻碍了更广泛的社区采用和算法创新。我们提出了VerlTool，一个通过系统性设计原则解决这些局限性的统一模块化框架。VerlTool提供了四个关键贡献：（1）与VeRL的上游对齐，确保兼容性和简化的维护；（2）通过标准化API实现统一的工具管理，支持包括代码执行、搜索、SQL数据库和视觉处理在内的多种模态；（3）异步轨迹执行，通过消除同步瓶颈实现了近2$\times$的加速；（4）在6个ARLT领域上进行的全面评估，展示了具有竞争力的性能。我们的框架将ARLT形式化为具有多模态观察标记（文本/图像/视频）的多轮轨迹，超越了单轮RLVR范式。我们在数学推理、知识问答、SQL生成、视觉推理、网络搜索和软件工程任务上训练和评估模型，在提供统一训练基础设施的同时，取得了与专用系统相当的结果。模块化的插件架构支持快速集成工具，仅需轻量级的Python定义即可，显著降低了开发开销，并为工具增强的RL研究提供了可扩展的基础。我们的代码已在 https://github.com/TIGER-AI-Lab/verl-tool 开源。