The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.
翻译:大型推理模型(LRMs)的进步催生了一种范式转变:从反应式“快思考”文本生成转向系统化、逐步式的“慢思考”推理,从而在复杂数学与逻辑任务中实现了最先进的性能。然而,该领域面临**词元级行为分析与内部推理机制之间的根本鸿沟,以及依赖昂贵外部验证器的推理优化强化学习(RL)不稳定性**。我们识别并正式定义了**熵梯度反转**——即词元熵值与对数几率梯度之间的稳健负相关关系,它作为LRM推理能力的确定性几何指纹。基于此,我们提出**相关正则化分组策略优化(CorR-PO)**,将该反转特征嵌入RL奖励正则化中。在多种模型规模下针对各类推理基准的广泛实验表明,CorR-PO始终优于最先进的基线方法,证实了更强的反转现象直接关联于更优的推理性能。