Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).
翻译:尽管强化学习扩展了大语言模型的认知边界,但在长时逻辑推理中仍易受自回归诅咒影响:生成早期引入的微小认知扰动会沿马尔可夫决策过程流不可逆地传播,触发级联失效,导致推理轨迹崩溃。为克服这种因单一早期失误即可破坏后续所有推理步骤的自回归级联,我们提出了动态认知熵驱动的可擦除强化学习($\text{E}^3\text{RL}$)。$\text{E}^3\text{RL}$ 摒弃对外部信号的依赖,将模型内生的局部自回归交叉熵作为认知不确定性的内在坐标。通过引入段级自适应动态阈值与优势分配,$\text{E}^3\text{RL}$ 使模型能精准切除局部逻辑缺陷,同时复用历史键-值缓存流,从而赋予推理过程自愈能力。我们在 DeepMath-103k 数据集上训练 $\text{E}^3\text{RL}$。实验结果表明,$\text{E}^3\text{RL}$ 重塑了长序列推理的探索效率,并在保持线性内存开销的同时提升了样本效率。在 AIME 等数学推理基准测试中,$\text{E}^3\text{RL}$ 取得了显著性能提升,4B 和 8B 参数模型分别超越此前最先进结果 5.349% 和 6.514%。这些发现表明,$\text{E}^3\text{RL}$ 打破了长序列推理中的自回归诅咒,为下一代自愈型通用人工智能奠定了理论与系统基础。