Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.
翻译:强化学习已成为增强基于大型语言模型的深度搜索智能体的关键技术。然而,现有方法主要依赖二元结果奖励,无法捕捉智能体推理过程的全面性与事实性,常导致捷径利用和幻觉等不良行为。为解决这些局限,我们提出**引证感知评分奖励**,这是一种面向深度搜索智能体的细粒度奖励框架,强调推理的全面性、事实依据与证据连通性。该方法将复杂问题分解为可验证的单步评分标准,要求智能体通过显式识别隐藏实体、提供正确引证支持,并构建连接预测答案的完整证据链来满足这些标准。我们进一步提出**引证感知分组相对策略优化**,该方法结合引证感知评分奖励与结果奖励来训练鲁棒的深度搜索智能体。实验表明,在多个深度搜索基准测试中,引证感知分组相对策略优化始终优于标准的基于结果的强化学习基线。我们的分析也验证了该方法能有效抑制捷径利用,促进全面、基于证据的推理,并对开放式深度研究任务展现出强大的泛化能力。代码与数据公开于 https://github.com/THUDM/CaRR。