Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.
翻译:搜索智能体通常被训练为在递增对话记录上执行策略的模型:模型必须同时决定如何搜索、记住已查看内容、识别有效证据、追踪未解决的约束条件以及确认已核验的声明。我们认为这种范式将过多常规状态管理任务纳入策略内部——强化学习被迫同时优化语义搜索决策与可由环境更可靠维护的可恢复性记账任务。本文提出Harness-1,一个在具备状态维护能力的搜索框架内经过强化学习训练的200亿参数搜索智能体(检索子智能体)。该框架维持环境侧工作记忆,包括候选池、带重要性标签的精选集、精简证据链接、验证记录、去重压缩观测结果以及预算感知的上下文渲染模块。策略层保留语义决策:搜索内容、文档取舍、验证对象与终止时机。在涵盖网页、金融、专利与多跳问答的八个检索基准测试中,Harness-1取得0.730的平均精选召回率,比次优的开源搜索子智能体高出11.4个百分点,且与规模更大的前沿模型型搜索器保持竞争力。该模型在留出迁移基准上的表现尤为突出,表明基于显式搜索状态的强化学习可产生超越训练领域的泛化检索行为。我们的代码已开源至https://github.com/pat-jj/harness-1。