大型推理模型是优秀的翻译质量评估工具吗？分析与性能提升 (Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost)

Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.

翻译：近期大型推理模型（LRMs）的进展引入了在生成最终答案前的中间“思考”过程，从而提升了其在复杂下游任务上的推理能力。然而，LRMs作为机器翻译（MT）质量评估工具的潜力仍未得到充分探索。我们首次对LRM作为评判者在机器翻译评估中的应用进行了系统性分析。我们识别了关键挑战，揭示了LRMs需要定制化的评估材料、倾向于对较简单实例“过度思考”，并且其评分机制存在问题导致评估结果偏高。为解决这些问题，我们提出通过使用合成的、类人思考轨迹对LRMs进行训练来校准其思考过程。我们在WMT24 Metrics基准测试上的实验表明，该方法大幅减少了约35倍的思考计算量，同时在不同规模（从7B到32B）的LRMs上提升了评估性能（例如，R1-Distill-Qwen-7B实现了+8.7的相关性分数提升）。这些发现凸显了高效校准的LRMs在推进细粒度自动机器翻译评估方面的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日