Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.

翻译：大语言模型在单轮指令遵循中展现出强大能力，但在多轮交互场景中随着信息逐步披露会出现性能衰减，即"对话丢失"（Lost-in-Conversation, LiC）现象。受当前基于可验证奖励的强化学习（RLVR）进展启发，我们提出基于可验证准确性与弃权奖励的课程强化学习框架（RLAAR），该框架不仅鼓励模型生成正确回答，更促使模型在多轮对话中判断问题的可解性。本方法采用能力门控式课程学习策略，通过逐步增加对话难度（以指令碎片为单位）稳定训练过程并提升可靠性。结合多轮在线策略采样与混合奖励机制，RLAAR引导模型在问题求解与知情弃权之间取得平衡，有效减少导致LiC的过早作答行为。在LiC基准测试中，RLAAR显著缓解了LiC性能衰减（从62.6%提升至75.1%），并将校准后的弃权率从33.5%提升至73.4%。这些结果共同为构建多轮可靠且值得信赖的LLM提供了实用方案。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

ICML 2026 | 理解上下文持续学习中的泛化与遗忘

专知会员服务

11+阅读 · 5月28日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

10+阅读 · 5月5日

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

面向软件工程的强化学习综述

专知会员服务

31+阅读 · 2025年7月21日