Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
翻译:大语言模型在单轮指令遵循中展现出强大能力,但在多轮交互场景中随着信息逐步披露会出现性能衰减,即"对话丢失"(Lost-in-Conversation, LiC)现象。受当前基于可验证奖励的强化学习(RLVR)进展启发,我们提出基于可验证准确性与弃权奖励的课程强化学习框架(RLAAR),该框架不仅鼓励模型生成正确回答,更促使模型在多轮对话中判断问题的可解性。本方法采用能力门控式课程学习策略,通过逐步增加对话难度(以指令碎片为单位)稳定训练过程并提升可靠性。结合多轮在线策略采样与混合奖励机制,RLAAR引导模型在问题求解与知情弃权之间取得平衡,有效减少导致LiC的过早作答行为。在LiC基准测试中,RLAAR显著缓解了LiC性能衰减(从62.6%提升至75.1%),并将校准后的弃权率从33.5%提升至73.4%。这些结果共同为构建多轮可靠且值得信赖的LLM提供了实用方案。