Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.
翻译:强化学习在优化复杂信息检索任务的搜索代理方面展现出巨大潜力。然而,现有方法主要依赖黄金监督(例如真实答案),这难以扩展。为克服这一局限,我们提出循环一致搜索——一种受无监督机器翻译与图像翻译中循环一致性技术启发的无黄金监督搜索代理训练框架。核心假设在于:与不充分或无关的搜索轨迹不同,最优搜索轨迹可无损编码问题意图。因此,高质量轨迹应保留准确重构原始问题所需的信息,从而为策略优化提供奖励信号。但朴素循环一致性目标易受信息泄露影响——重构可能依赖表面词汇线索而非底层搜索过程。为缓解此效应,我们应用信息瓶颈,包括排除最终响应及对搜索查询进行命名实体识别遮蔽。这些约束迫使重构必须依赖检索结果与结构框架,确保所产生奖励信号反映信息充分性而非语言冗余。在问答基准上的实验表明,CCS能达到与监督基线相当的性能,同时优于不依赖黄金监督的先前方法。这些结果表明,CCS为在缺乏黄金监督的场景下训练搜索代理提供了可扩展的训练范式。