The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.
翻译:追求大语言模型(LLMs)达到专家级推理能力一直受到一个持久的“奖励瓶颈”的阻碍:传统的强化学习(RL)依赖于标量奖励,这种奖励扩展成本高昂、跨领域表现脆弱,并且对解决方案的内在逻辑是盲目的。这种对外部贫乏信号的依赖阻碍了模型对推理原则形成深刻、自洽的理解。我们提出了ALIVE(基于指导性言语评估的对抗学习),一种无需人工干预的对齐框架,它超越了标量奖励优化,转向内在推理能力的习得。ALIVE基于“认知协同”原则,将问题提出、解决与评判统一于单个策略模型中,从而内化正确性的逻辑。通过将对抗学习与指导性言语反馈相结合,ALIVE使模型能够直接从原始语料中内化评估标准,有效地将外部评判转化为内生的推理能力。在数学推理、代码生成和一般逻辑推理基准测试上的实证评估表明,ALIVE持续缓解了奖励信号的局限性。在相同数据和计算资源下,它实现了准确率提升、跨领域泛化能力显著改善以及更高的自我纠正率。这些结果表明,推理三位一体促成了能力增长的自我维持轨迹,使ALIVE成为无需人工循环监督即可进行通用推理对齐的可扩展基础。