Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation

Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.

翻译：机器翻译评估已超越单纯指标度量，转向更具体的语言现象研究。针对英汉语言对，由于语言差异，被动句的构造与分布存在显著区别，因此在机器翻译中需特别关注。本文提出一个双向多领域被动句数据集，该数据集从五个汉英平行语料库中提取，并根据人工翻译自动标注结构标签，同时包含一个经人工验证标注的测试集。数据集共包含73,965个平行句对（2,358,731个英文单词，3,498,229个汉字）。我们使用该数据集评估了两个先进的开源机器翻译系统，并使用测试集评估了四个商业模型。结果表明，与人类译者不同，模型更易受源文本语态的影响而非源语言的整体语态使用习惯，因此在双向翻译中均倾向于保持被动语态。然而，模型展现出对汉语被动句低频性及主要负面语境的一定认知，导致英译汉时比汉译英时与人类译者的语态一致性更高。商业神经机器翻译模型在指标评估中得分更高，但大语言模型展现出更优的多样化替代翻译能力。数据集与标注脚本将根据需求提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文档级神经机器翻译综述

专知会员服务

13+阅读 · 2024年8月29日

【博士论文】⾮⾃回归神经机器翻译的训练⽅法研究

专知会员服务

19+阅读 · 2023年12月9日

专知会员服务

30+阅读 · 2021年1月25日

稀缺资源语言神经网络机器翻译研究综述

专知会员服务

27+阅读 · 2020年12月2日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation