Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing sentence-level automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.
翻译:尽管自动评估指标在翻译质量评估方面近期取得了成功,但它们在机器翻译聊天质量评估中的应用仍然有限。与新闻等结构化文本不同,聊天对话通常结构松散、篇幅短小且高度依赖上下文信息。这引发了关于现有句子级指标在该领域的可靠性以及上下文在评估翻译质量中作用的疑问。基于此,我们对现有主要为新闻等结构化领域设计的句子级自动评估指标进行了元评估,以评估机器翻译聊天质量。研究发现,无参考指标的表现落后于基于参考的指标,尤其在评估非英语语境的翻译质量时更为明显。随后,我们探究了在指标中融入对话上下文信息对其性能的影响。结果表明,在无参考场景及评估非英语语境翻译时,增强神经学习指标中的上下文信息有助于提高与人工判断的相关性。最后,我们提出了一种新的评估指标Context-MQM,该指标利用大型语言模型(LLM)处理双语上下文,并进一步验证了即便对基于LLM的评估指标而言,添加上下文信息同样具有积极作用。