Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., MMLU-Pro and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.
翻译:大型语言模型(LLMs)越来越多地通过可验证奖励的强化学习(RLVR)进行训练,然而实际部署需要模型能够在没有标签或外部评判的情况下自我改进。现有的自我改进方法主要依赖自我确认信号(如置信度、熵或一致性)来生成奖励。这种依赖导致模型趋向于过度自信、多数偏好的解,引发熵崩溃,从而降低pass@n指标和推理复杂度。为解决此问题,我们提出EVOL-RL——一个模拟进化过程中选择与变异平衡原则的无标签框架。具体而言,EVOL-RL保留多数投票答案作为稳定性锚点,但增加了一个新颖性感知奖励,该奖励通过每个采样解与其他同时生成的响应在推理路径上的差异程度进行评分。这种“多数求稳定+新奇促探索”的规则体现了变异-选择原则:选择防止漂移,而新奇防止崩溃。评估结果表明,EVOL-RL持续优于仅使用多数的基线方法;例如,在无标签的AIME24数据上训练,将Qwen3-4B-Base模型在AIME25上的pass@1从基线的4.6%提升至16.4%,pass@16从18.5%提升至37.9%。EVOL-RL不仅防止了领域内多样性崩溃,还提升了跨领域泛化能力(从数学推理扩展到更广泛任务,如MMLU-Pro和BBEH)。代码已开源:https://github.com/YujunZhou/EVOL-RL。