Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.

翻译：本文研究了低资源神经机器翻译中词级语言标注的影响，目前文献对此尚不完全明确。研究涵盖八种语言对、不同训练语料库规模、两种架构以及三种标注类型：虚拟标签（不含任何语言信息）、词性标签和形态句法描述标签（包含词性和形态特征）。这些语言标注以每个词前的单个标签形式交错嵌入输入或输出流中。为衡量每种场景下的性能，我们采用自动评估指标并进行自动错误分类。实验表明，总体而言，源语言标注具有帮助性，且对于某些语言对，形态句法描述优于词性标注。相反，当目标语言中的词汇被标注时，就自动评估指标而言，词性标签系统性地优于形态句法描述标签，尽管使用形态句法描述标签能提升输出的语法正确性。本文对此结果背后的原因提供了详细分析。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

斯坦福李飞飞高徒Johnson博士论文: 组成式计算机视觉智能,195页PDF

专知会员服务

71+阅读 · 2019年10月27日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日