Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill-Aware Measurement (SAM), which evaluates diversity by the reasoning skills exercised rather than surface variation of questions. Across 10 math and general reasoning benchmarks, R-Diverse sustains gains over more iterations and consistently outperforms prior self-play methods. Code is available at https://github.com/Gengsheng-Li/R-Diverse.
翻译:自博弈通过迭代的挑战者-求解器循环来引导大语言模型的推理能力:挑战者被训练生成针对求解器能力的问题,而求解器则在生成的数据上进行优化以扩展其推理技能。然而,现有框架(如R-Zero)常表现出不可持续的改进,即早期获得的增益会随着自博弈的持续而衰减。我们识别出一种关键失效模式——多样性幻觉,即求解器的训练信号看似多样,实则坍缩为重复的底层模式。其具体表现为:(1)局部多样性幻觉,即多样性仅在批次内被强制保证,导致跨迭代的模式循环;(2)表面多样性幻觉,即问题在表面上变化,但所需的推理技能近乎相同。为缓解这些问题,我们提出了R-Diverse及其两项协同创新:记忆增强惩罚(MAP),它使用一个持久记忆库来抑制跨迭代的重复利用;以及技能感知度量(SAM),它通过所运用的推理技能而非问题的表面变化来评估多样性。在10个数学与通用推理基准测试中,R-Diverse能够在更多迭代中保持性能增益,并持续优于先前的自博弈方法。代码发布于 https://github.com/Gengsheng-Li/R-Diverse。