Extrinsic Evaluation of Machine Translation Metrics

Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. We synthesise our analysis into recommendations for future MT metrics to produce labels rather than scores for more informative interaction between machine translation and multilingual language understanding.

翻译：自动机器翻译指标广泛用于在相对较大的测试集上区分机器翻译系统的翻译质量（系统级评估）。然而，在句子层面（段落级评估）上，自动指标是否可靠地区分优质翻译与劣质翻译尚不明确。本文探讨了当机器翻译组件嵌入包含下游任务的大型平台时，机器翻译指标在检测其成功程度方面的有用性。我们在三个下游跨语言任务（对话状态跟踪、问答和语义解析）上评估了最广泛使用的机器翻译指标（chrF、COMET、BERTScore等）的段落级性能。对于每个任务，我们仅能访问单语的任务特定模型。我们计算了指标预测翻译好坏的能力与在“翻译-测试”设置下最终任务成败之间的相关性。实验表明，所有指标与下游结果的外部评估均表现出可忽略的相关性。我们还发现，神经指标提供的分数主要因范围未定义而难以解释。我们将分析综合为对未来机器翻译指标的建议：生成标签而非分数，以促进机器翻译与多语言语言理解之间更具信息性的交互。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日