Although proper handling of discourse significantly contributes to the quality of machine translation (MT), these improvements are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation, however not in a fully systematic way. In this paper, we develop the Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena in any given dataset. The choice of phenomena is inspired by a novel methodology to systematically identify translations requiring context. We confirm the difficulty of previously studied phenomena while uncovering others that were previously unaddressed. We find that common context-aware MT models make only marginal improvements over context-agnostic models, which suggests these models do not handle these ambiguities effectively. We release code and data for 14 language pairs to encourage the MT community to focus on accurately capturing discourse phenomena.
翻译:尽管对话语现象的处理显著提升了机器翻译(MT)的质量,但这些改进并未在常见的翻译质量评估指标中得到充分体现。近年来,上下文感知型机器翻译的相关研究试图在评估中针对少量话语现象进行测试,但尚未形成完全系统化的方法。本文提出了多语言话语感知(MuDA)评测基准——一系列能够识别并评估任何给定数据集中话语现象的标注工具。现象的选择灵感来源于一种系统化识别需要上下文翻译的新方法论。我们一方面验证了已有研究现象的处理难度,另一方面发现了此前未被关注的其他现象。研究发现,常见的上下文感知型机器翻译模型相比非上下文感知型模型仅能带来边际性改进,这表明现有模型未能有效处理这些歧义现象。我们发布了涵盖14个语言对的代码与数据,旨在推动机器翻译社区聚焦于精确捕捉话语现象的研究。