The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.
翻译:大型推理模型(LRMs)的进展催生了从反应式“快思考”文本生成到系统性、逐步“慢思考”推理的范式转变,在复杂数学与逻辑任务中实现了最优性能。然而,该领域面临**词元层面行为分析与内部推理机制之间的根本鸿沟,以及依赖昂贵外部验证器的强化学习(RL)在推理优化中的不稳定性**。我们识别并正式定义了**熵梯度反转(Entropy-Gradient Inversion)**——一种词元熵与对数几率梯度之间的稳健负相关关系,可作为LRM推理能力的确定性几何特征。基于此,我们提出了**相关性正则化群体策略优化(CorR-PO)**,该算法将这一反转特征嵌入RL奖励正则化中。在多种模型规模下的多个推理基准上的广泛实验表明,CorR-PO稳定优于最先进的基线模型,证实更强的反转直接关联更优的推理性能。