Since its introduction, the COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores is not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the SacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.
翻译:自问世以来,COMET指标凭借其与人工翻译质量评估结果的高度相关性,在机器翻译领域开辟了新道路。其成功源于它是在经过修改的预训练多语言模型基础上微调而成的质量评估工具。然而,作为机器学习模型,它也存在一系列可能尚未广为人知的潜在陷阱。我们从三个维度探究这些意外行为:1)技术层面:过时的软件版本与计算精度问题;2)数据层面:测试时的空内容、语言不匹配及翻译腔现象,以及训练数据中的分布偏差与领域偏差;3)使用与报告层面:多参考译文支持机制及学术文献中的模型引用规范。所有这些问题都意味着不同论文甚至不同技术配置下的COMET分数不具备可比性,我们针对每个问题提出了改进视角。此外,我们发布了SacreCOMET工具包,可为软件和模型配置生成特征标识符并提供规范引用格式。本研究旨在帮助学界更规范地使用COMET评估指标。