Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
翻译:近期的大型语言模型(LLM)在生成复杂任务的有前景推理路径方面表现出色。然而,尽管具备强大的生成能力,LLM在验证自身答案方面仍然薄弱,这揭示了生成与自我验证之间存在持续的能力不对称性。在本研究中,我们深入探究了这种不对称性在训练演化过程中的表现,并证明即使在同一任务上,提升生成能力并不会带来自我验证能力的相应改善。有趣的是,我们发现这种不对称性的反向作用表现不同:学习自我验证能够有效提升生成性能,在达到与标准生成训练相当的准确率的同时,产生更高效且有效的推理轨迹。基于这一观察,我们进一步探索将自我验证整合到生成训练中,通过构建一个多任务强化学习框架,将生成与自我验证优化为两个独立但互补的目标。跨基准和模型的广泛实验表明,相较于纯生成训练,该方法在生成和验证能力上均取得了性能提升。