Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency--accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at https://github.com/alphadl/DDCA.
翻译:可验证奖励强化学习(RLVR)能够引发强大的多步推理能力,但其常导致推理轨迹过于冗长。此外,在群体相对优化中简单引入长度惩罚会严重损害准确性。我们将此问题归因于两个结构性缺陷:(i)长度基线稀释——错误答案(其长度奖励为零)会拉低群体基线,从而过度惩罚正确解;(ii)难度-惩罚失配——静态惩罚无法适应问题难度,既抑制了困难实例的必要推理,又未能消除简单实例中的冗余。为此,我们提出动态解耦条件优势(DDCA),将效率优化与正确性解耦。DDCA通过在正确响应簇内计算条件长度优势以消除基线稀释,并利用群体通过率作为难度代理动态调整惩罚强度。在GSM8K、MATH500、AMC23和AIME25上的实验表明,相较于自适应基线方法,DDCA持续优化了效率-准确性的权衡:在简单任务(如GSM8K)上减少约60%的生成标记,在困难基准(如AIME25)上仍能减少超过20%,同时保持或提升了准确性。代码发布于 https://github.com/alphadl/DDCA。