We investigate MT evaluation metric performance on adversarially-synthesized texts, to shed light on metric robustness. We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We also identify inconsistencies in BERTScore ratings, where it judges the original sentence and the adversarially-degraded one as similar, while judging the degraded translation as notably worse than the original with respect to the reference. We identify patterns of brittleness that motivate more robust metric development.
翻译:我们通过对抗合成文本探究机器翻译评估指标的性能,以揭示指标鲁棒性问题。我们在三种主流机器翻译指标(BERTScore、BLEURT和COMET)上分别进行词级和字符级攻击实验。人工实验验证表明,自动指标倾向于过度惩罚经对抗降质的翻译结果。我们还发现BERTScore评分存在不一致性:该模型认为原始句子与对抗降质句子相似,但判定降质翻译相对于参考译文的质量明显劣于原始译文。本研究识别出的脆弱性模式,为开发更鲁棒的评估指标提供了动力。