Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from $75.4\%$ to $78.5\%$ after the bridge and outperforms a matched replay control by $2.8$ points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

翻译：在标注可验证训练数据构成约束条件的情况下，每个经过核查的样例都应被审慎分配。当前标准实践是直接将这些数据用于将要部署的模型，例如在部署学生模型上运行GRPO。我们认为这通常是一种低效的分配方式，因为它忽视了一条奖励密度原则：稀疏的序列级奖励应训练具有探索产出的模型，而稠密的词元级教师奖励则应用于压缩行为到更小模型的场景。基于此视角，GRPO风格的稀疏强化学习与OPD风格的稠密教师监督并非独立方案，而是分属不同的奖励密度机制。分配规则十分简明：将稀缺的标注训练数据向上游的最强模型分配，使其能够将数据转化为奖励塑形行为，随后通过稠密监督将行为向下游转移。我们在可验证数学任务上使用Qwen3和Llama模型评估了这一规则。在固定Qwen3-1.7B部署学生模型规模时，经过密集桥接蒸馏的RL改进型8B教师模型，其性能优于直接对该学生模型应用GRPO；而使用RL改进前相同教师模型进行的迁移则表现欠佳。该桥接机制至关重要：先对教师回放进行前向KL预热，再对学生回放进行OPD，这种组合在未进行任何桥接后学生侧稀疏RL之前即于MATH任务上取得最强表现，同时为经典的8B/14B教师模型提供了最优的预阶段三AIME终点结果。该桥接机制还能有效激活后续的学生侧稀疏RL：在对冷启动学生模型表现薄弱的GRPO，经桥接后其MATH任务性能从75.4%提升至78.5%，并超出匹配的重放对照实验2.8个百分点。其操作原则是：避免将稀缺标注数据用于准备最不充分策略——稀疏奖励用于教师侧探索，稠密迁移用于学生模型压缩，而学生侧稀疏奖励仅应在桥接完成后启用。