Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
翻译:在大语言模型(LLM)的领域特定数据对齐方面,仍然存在基础性挑战。监督微调(SFT)提供了一种直接注入领域知识的方法,但往往会损害模型的泛化能力。相比之下,在线策略强化学习(RL)能保持泛化性,却难以有效吸收超出模型当前推理水平的困难样本。近期的离线策略RL尝试提高了困难样本的利用率,但由于强制向离线策略知识进行分布偏移,导致严重的训练不稳定性。为了在有效吸收离线策略知识与保持在线策略RL的稳定性之间取得平衡,我们提出了重述策略优化(RePO)。在RePO中,策略模型被提示首先理解离线策略知识,然后将其重述为符合自身风格和参数分布的轨迹。RePO动态地用这些重述后的高质量轨迹替换低奖励的轨迹。该策略引导模型走向正确的推理路径,同时严格保持在线策略训练的动态特性。在多个基准测试上的实验表明,RePO提高了困难样本的利用率,并超越了现有基线方法,实现了最先进的性能。