Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that canonical search agents trained via Reinforcement Learning from Verifiable Reward (RLVR) -- including SearchR1 and ReSearch -- have significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS(Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve better task performance compared to the baselines trained against pure outcome-based reward.
翻译:受强化学习在大型语言模型数学与代码等领域训练成功的启发,近期研究开始探索如何训练LLM更有效地将搜索引擎作为检索增强生成的工具。尽管这些方法在问答基准测试中取得了性能提升,但许多方法优先考虑最终答案的正确性,而忽视了中间推理步骤的质量,这可能导致思维链不忠实。本文首先引入一个全面的评估框架,用于评估基于强化学习的搜索智能体,涵盖三个不同的忠实性指标:信息-思维忠实性、思维-答案忠实性和思维-搜索忠实性。我们的评估表明,通过可验证奖励强化学习训练的典型搜索智能体——包括SearchR1和ReSearch——在这方面仍有显著的改进空间。为促进忠实推理,我们提出了VERITAS(通过智能体搜索中的中间可追溯性验证蕴含推理),这是一个将细粒度忠实性奖励整合到强化学习过程中的新颖框架。实验表明,使用VERITAS训练的模型不仅显著提高了推理忠实性,而且相比仅基于结果奖励训练的基线模型,实现了更好的任务性能。