We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.
翻译:本文提出ChatR1,一种基于强化学习(RL)的对话式问答(CQA)推理框架。在CQA中,推理扮演着重要角色,因为用户意图在对话轮次间不断演变,且话语表达往往不完整,需要进行上下文理解、查询重构以及检索与生成之间的动态协调。与静态的“重写、检索、生成”流水线不同,ChatR1在对话轮次间交错进行搜索与推理,实现了通过RL学习到的探索性与自适应行为。为应对RL中奖励稀疏且延迟的挑战,我们提出了一种意图感知奖励机制,通过将检索和推理与演变的用户目标对齐,提供轮次级别的反馈。我们提出的ChatR1在3B和7B模型骨干上均表现出色,在五个CQA数据集上,通过不同指标(F1、BERTScore和LLM-as-judge)评估,均优于竞争模型。我们纳入了多样化的CQA数据集,涵盖主题转换、意图演变、混合主动对话和多文档基础,从多个方面测试ChatR1的性能。消融研究证实了意图感知奖励的有效性。我们的分析进一步揭示了多样化的推理轨迹以及对搜索工具的有效利用。ChatR1还在不同领域展现出稳健的泛化能力,表明基于RL的推理能够实现比静态CQA流水线更灵活且对上下文更敏感的行为。