Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xCOMET, an open-source learned metric designed to bridge the gap between these approaches. xCOMET integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
翻译:广泛使用的机器翻译评估学习指标(如COMET和BLEURT)通过提供单一句子级评分来估计翻译假设的质量,因此对翻译错误(例如错误的类型及其严重程度)提供的洞察有限。另一方面,生成式大语言模型(LLM)正在推动更细粒度评估策略的采用,试图详述并分类翻译错误。在本工作中,我们提出xCOMET——一个旨在弥合这些方法之间差距的开源学习指标。xCOMET整合了句子级评估与错误跨度检测能力,在所有评估类型(句子级、系统级和错误跨度检测)中均展现出最先进的性能。更重要的是,它在突出并分类错误跨度的同时丰富了质量评估。我们还通过压力测试进行了鲁棒性分析,结果表明xCOMET能够有效识别局部关键错误和幻觉。