Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
翻译:可验证奖励强化学习已成为大语言模型推理的标准范式。然而,仅针对最终答案正确性进行优化往往导致模型陷入盲目、冗长的探索,其依赖穷举试错策略而非结构化规划来获得解。虽然长度惩罚等启发式约束可降低冗余性,但常会截断必要的推理步骤,造成效率与可验证性之间的艰难权衡。本文主张判别能力是高效生成的前提:通过学习区分有效解,模型可内化一种能剪枝搜索空间的引导信号。我们提出JudgeRLVR,一种两阶段的"先判别后生成"范式。第一阶段,训练模型对具有可验证答案的求解响应进行判别;第二阶段,以判别模型为初始化,通过标准生成式RLVR对同一模型进行微调。在使用相同数学领域训练数据的情况下,与原始RLVR相比,JudgeRLVR为Qwen3-30B-A3B实现了更优的质量-效率权衡:在领域内数学任务上,平均准确率提升约+3.7分的同时平均生成长度减少42%;在领域外基准测试中,平均准确率提升约+4.5分,展现出更强的泛化能力。