Reinforcement learning (RL) for large language models (LLMs) has shown strong performance in single-turn tasks, but extending it to multi-turn interaction remains challenging due to sparse rewards and poor per-turn credit assignment. In emotional support dialogues, responses shape future user states, so matched-state step-wise comparison is unavailable, while trajectory-level supervision is insufficient. We propose MICA (Multi-granularity Intertemporal Credit Assignment), a critic-free RL framework for multi-turn emotional support tasks. MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. On EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B, MICA consistently outperforms GRPO and REINFORCE++, achieving up to +43.2 on EMPA, while adding no rollout cost and remaining robust to reward judges. These results show that turn-aware credit assignment enables effective and practical multi-turn RL for interactive LLMs.
翻译:强化学习(RL)在大语言模型(LLMs)中已在单轮任务上展现出强劲性能,但将其扩展到多轮交互仍面临稀疏奖励和单轮信用分配不足的挑战。在情感支持对话中,响应会塑造用户未来状态,因此无法进行匹配状态的逐轮比较,而轨迹级监督又不够充分。我们提出MICA(多粒度跨时信用分配),一种面向多轮情感支持任务的无评论家RL框架。MICA通过用户结构化支持状态上的共享势函数,同时推导出即时信用和延迟信用。增量距离奖励衡量与目标状态的剩余距离在每轮中的减少量,而其蒙特卡洛回报则捕获延迟效应。经过作用域特定的归一化后,这两个信号组成混合优势函数,无需匹配状态比较、展开树或习得评论家,即可实现稳定的逐轮优化。在基于Qwen2.5-7B-Instruct和Qwen3-8B/14B/32B的EMPA、EQ-Bench和EmoBench基准上,MICA始终优于GRPO和REINFORCE++,在EMPA上提升高达+43.2,同时不增加展开成本且对奖励评判器保持鲁棒。这些结果表明,面向轮次的信用分配能实现面向交互式LLMs的有效且实用的多轮RL。