Several recent papers claim human parity at sentence-level Machine Translation (MT), especially in high-resource languages. Thus, in response, the MT community has, in part, shifted its focus to document-level translation. Translating documents requires a deeper understanding of the structure and meaning of text, which is often captured by various kinds of discourse phenomena such as consistency, coherence, and cohesion. However, this renders conventional sentence-level MT evaluation benchmarks inadequate for evaluating the performance of context-aware MT systems. This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. (2022). The new BWB annotation introduces four extra evaluation aspects, i.e., entity, terminology, coreference, and quotation, covering 15,095 entity mentions in both languages. Using these annotations, we systematically investigate the similarities and differences between the discourse structures of source and target languages, and the challenges they pose to MT. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures. This gives us a new perspective on the challenges and opportunities in document-level MT. We make our resource publicly available to spur future research in document-level MT and the generalization to other language translation tasks.
翻译:近期多篇论文宣称在句子级机器翻译(MT)中达到人类水平,尤其体现在高资源语言中。为此,机器翻译社区部分地将关注点转向文档级翻译。文档翻译需要更深入地理解文本的结构与含义,而这类理解通常通过一致性、连贯性与衔接等多种语篇现象得以体现。然而,这一转向使得传统的句子级机器翻译评估基准难以有效评估上下文感知型MT系统的性能。本文基于Jiang等人(2022)提出的大规模平行语料库BWB,构建了一个富含语篇标注的新数据集。新型BWB标注引入了实体、术语、共指与引文四个额外评估维度,覆盖了15,095个双语实体指称。借助这些标注,我们系统研究了源语言与目标语言语篇结构的异同及其对机器翻译提出的挑战。研究发现,机器翻译输出与人工翻译在潜在语篇结构上存在本质差异,这为文档级机器翻译的挑战与机遇提供了新视角。我们已将相关资源公开,以促进文档级机器翻译及跨语言翻译任务的泛化研究。