Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2). Our code and a set of representative pretrained models are publicly available at \url{https://github.com/NJUNLP/peg-rl} and \url{https://huggingface.co/collections/DGME/pegrl}
翻译:强化学习(RL)在大语言模型(LLM)机器翻译中展现出强大潜力,近期GRPO等方法已取得显著进展。然而,翻译导向的强化学习仍面临两大挑战:蒙特卡洛回报估计带来的噪声学习信号,以及广阔轨迹空间中偏好全局探索而忽视细粒度局部优化的结构性问题。本文提出\textbf{PEGRL},一种采用\textit{两阶段}框架的强化学习方法,将译后编辑作为辅助任务以稳定训练并引导整体优化。在每次迭代中,通过采样翻译输出构建译后编辑输入,使译后编辑阶段的回报估计能够受益于当前翻译行为的条件约束,同时兼顾全局探索与细粒度局部优化。我们设计的任务特定加权机制进一步平衡了翻译与译后编辑目标的贡献,形成有偏但样本效率更高的估计器。在英译芬、英译土及英汉双向翻译实验上,该方法相比强化学习基线取得持续提升;在英译土任务中,COMET-KIWI指标表现可媲美基于先进LLM的系统(DeepSeek-V3.2)。相关代码及代表性预训练模型已开源发布于\url{https://github.com/NJUNLP/peg-rl}与\url{https://huggingface.co/collections/DGME/pegrl}。