Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.
翻译:基于可验证奖励的强化学习(RLVR)已成为训练大语言模型智能体的主流技术。然而,RLVR高度依赖于精心设计的任务查询和对应的真实答案来提供精确奖励,这需要大量人工投入,阻碍了强化学习的规模化进程,尤其在智能体场景下。尽管近期少数研究探索了任务合成方法,但生成任务的难度难以控制,无法为强化学习训练提供有效优势。为实现更具可扩展性的智能体RLVR,我们探索了深度搜索智能体的自博弈训练,其中学习型大语言模型利用多轮搜索引擎调用,同时充当任务提出者和问题解决者。任务提出者旨在生成具有明确定义的真实答案且任务难度递增的深度搜索查询。问题解决者则尝试处理生成的搜索查询并输出正确答案预测。为确保每个生成的搜索查询具备准确真实答案,我们收集提出者轨迹中的所有搜索结果作为外部知识,随后进行检索增强生成(RAG)测试,以验证在提供所有必要搜索文档的情况下,所提出的查询能否被正确回答。在此搜索自博弈(SSP)框架中,提出者与解决者通过竞争与合作共同进化其智能体能力。大量实验结果表明,SSP能在零样本和持续强化学习训练设置下,无需任何监督即显著提升搜索智能体在各类基准测试上的综合性能。代码发布于 https://github.com/Alibaba-Quark/SSP。