Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. We synthesise our analysis into recommendations for future MT metrics to produce labels rather than scores for more informative interaction between machine translation and multilingual language understanding.
翻译:自动机器翻译指标广泛用于在相对较大的测试集上区分机器翻译系统的翻译质量(系统级评估)。然而,在句子层面(段落级评估)上,自动指标是否可靠地区分优质翻译与劣质翻译尚不明确。本文探讨了当机器翻译组件嵌入包含下游任务的大型平台时,机器翻译指标在检测其成功程度方面的有用性。我们在三个下游跨语言任务(对话状态跟踪、问答和语义解析)上评估了最广泛使用的机器翻译指标(chrF、COMET、BERTScore等)的段落级性能。对于每个任务,我们仅能访问单语的任务特定模型。我们计算了指标预测翻译好坏的能力与在“翻译-测试”设置下最终任务成败之间的相关性。实验表明,所有指标与下游结果的外部评估均表现出可忽略的相关性。我们还发现,神经指标提供的分数主要因范围未定义而难以解释。我们将分析综合为对未来机器翻译指标的建议:生成标签而非分数,以促进机器翻译与多语言语言理解之间更具信息性的交互。