Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.
翻译:大型语言模型日益依赖长链思维来提升准确性,但这种增益伴随着高昂的推理时间成本。我们重新审视高效推理的后训练方法,认为现有的序列级奖励塑造方法对推理努力在词元间的分配控制有限。为弥补这一差距,我们提出IAPO——一种基于信息论的后训练框架,该框架根据每个词元与最终答案的条件互信息(MI)为其分配词元级优势值。这提供了一种显式且原理性的机制,用于识别信息丰富的推理步骤并抑制低效探索。我们通过理论分析证明,IAPO能在不影响正确性的前提下单调降低推理冗余度。实证研究表明,IAPO在多种推理数据集上持续提升推理准确率,同时将推理长度缩减达36%,优于现有高效词元强化学习方法。大量实证评估表明,信息感知的优势值塑造是实现高效推理后训练的强大且通用的方向。代码发布于https://github.com/YinhanHe123/IAPO。