Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
翻译:用于语言模型推理的策略梯度方法(如GRPO和DAPO)对所有生成的词元分配均匀的信用——填充短语“让我想想”与关键计算“23 + 45 = 68”获得相同的梯度更新。我们提出反事实重要性加权方法:遮蔽推理片段,测量答案概率的下降幅度,并在策略梯度更新中相应地上调相关词元的权重。我们的方法无需辅助模型或外部标注,重要性直接通过策略模型自身的概率变化进行估计。在GSM8K数据集上对涵盖Qwen和Llama家族的三个模型进行的实验表明,该方法相较于均匀基线实现了持续的性能提升,并能更快收敛至同等准确率。反转重要性信号会损害性能,这证实了我们捕捉到的是真实的因果结构而非噪声。分析表明,该方法能正确优先处理计算步骤而非框架文本。我们认为这些发现确立了反事实重要性加权作为进一步研究的基础,而非一个完整的解决方案。