This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.
翻译:本文研究了低资源神经机器翻译中词级语言标注的影响,目前文献对此尚不完全明确。研究涵盖八种语言对、不同训练语料库规模、两种架构以及三种标注类型:虚拟标签(不含任何语言信息)、词性标签和形态句法描述标签(包含词性和形态特征)。这些语言标注以每个词前的单个标签形式交错嵌入输入或输出流中。为衡量每种场景下的性能,我们采用自动评估指标并进行自动错误分类。实验表明,总体而言,源语言标注具有帮助性,且对于某些语言对,形态句法描述优于词性标注。相反,当目标语言中的词汇被标注时,就自动评估指标而言,词性标签系统性地优于形态句法描述标签,尽管使用形态句法描述标签能提升输出的语法正确性。本文对此结果背后的原因提供了详细分析。