While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.
翻译:尽管组间相对策略优化(GRPO)为大语言模型后训练提供了强大框架,但其在机器翻译等开放式领域中的有效性依赖于精确的组内排序。我们发现标准标量质量指标在此场景中存在局限:通过独立评估候选译文,这类指标缺乏区分细粒度语言差异所需的对比语境。为此,我们提出组质量指标范式,并通过组间相对奖励模型实现该范式。与传统独立评分器不同,GRRM对完整候选组进行联合处理,利用对比分析严格判定相对质量并实现自适应粒度划分。实证评估证实GRRM在所有基线中取得了具有竞争力的排序准确率。基于此基础,我们将GRRM集成至GRPO训练循环中以优化翻译策略。实验结果表明,该框架不仅提升了通用翻译质量,还展现出与最先进推理模型相当的推理能力。代码、数据集及模型检查点已发布于https://github.com/NJUNLP/GRRM。