Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.
翻译:协同进化式自我对弈(即一个语言模型生成问题而另一个模型求解问题)有望在无需人工监督的情况下实现自主课程学习。然而在实践中,生成器会迅速收敛至一个满足奖励函数的狭窄问题分布。这种多样性崩溃使得生成器产生的课程对求解器失去信息价值,从而导致协同进化循环停滞。我们提出词汇丢弃机制——在策略训练与课程生成阶段对生成器输出对数几率施加随机掩码——作为维持多样性的轻量化手段。该掩码具有硬性非平稳特征,可阻止生成器锁定至固定词元序列。通过在Qwen3-4B与Qwen3-8B模型上采用R-Zero方法进行数学推理训练,我们发现词汇丢弃机制能在整个训练过程中维持生成器在词汇、语义与功能指标上的多样性,并使求解器在8B规模下平均提升4.4个百分点,在竞赛级基准测试中取得最大增益。我们的研究结果表明,显式的动作空间约束(类似于经典自我对弈中游戏规则的结构性作用)有助于维持语言系统中的生产性协同进化,而词汇丢弃机制正是这一原理的简单实现。