Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.
翻译:机器翻译评估已超越单纯指标度量,转向更具体的语言现象研究。针对英汉语言对,由于语言差异,被动句的构造与分布存在显著区别,因此在机器翻译中需特别关注。本文提出一个双向多领域被动句数据集,该数据集从五个汉英平行语料库中提取,并根据人工翻译自动标注结构标签,同时包含一个经人工验证标注的测试集。数据集共包含73,965个平行句对(2,358,731个英文单词,3,498,229个汉字)。我们使用该数据集评估了两个先进的开源机器翻译系统,并使用测试集评估了四个商业模型。结果表明,与人类译者不同,模型更易受源文本语态的影响而非源语言的整体语态使用习惯,因此在双向翻译中均倾向于保持被动语态。然而,模型展现出对汉语被动句低频性及主要负面语境的一定认知,导致英译汉时比汉译英时与人类译者的语态一致性更高。商业神经机器翻译模型在指标评估中得分更高,但大语言模型展现出更优的多样化替代翻译能力。数据集与标注脚本将根据需求提供。