Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
翻译:近期大型推理模型(LRMs)的进展引入了在生成最终答案前的中间“思考”过程,从而提升了其在复杂下游任务上的推理能力。然而,LRMs作为机器翻译(MT)质量评估工具的潜力仍未得到充分探索。我们首次对LRM作为评判者在机器翻译评估中的应用进行了系统性分析。我们识别了关键挑战,揭示了LRMs需要定制化的评估材料、倾向于对较简单实例“过度思考”,并且其评分机制存在问题导致评估结果偏高。为解决这些问题,我们提出通过使用合成的、类人思考轨迹对LRMs进行训练来校准其思考过程。我们在WMT24 Metrics基准测试上的实验表明,该方法大幅减少了约35倍的思考计算量,同时在不同规模(从7B到32B)的LRMs上提升了评估性能(例如,R1-Distill-Qwen-7B实现了+8.7的相关性分数提升)。这些发现凸显了高效校准的LRMs在推进细粒度自动机器翻译评估方面的潜力。