Lexically-constrained NMT (LNMT) aims to incorporate user-provided terminology into translations. Despite its practical advantages, existing work has not evaluated LNMT models under challenging real-world conditions. In this paper, we focus on two important but under-studied issues that lie in the current evaluation process of LNMT studies. The model needs to cope with challenging lexical constraints that are "homographs" or "unseen" during training. To this end, we first design a homograph disambiguation module to differentiate the meanings of homographs. Moreover, we propose PLUMCOT, which integrates contextually rich information about unseen lexical constraints from pre-trained language models and strengthens a copy mechanism of the pointer network via direct supervision of a copying score. We also release HOLLY, an evaluation benchmark for assessing the ability of a model to cope with "homographic" and "unseen" lexical constraints. Experiments on HOLLY and the previous test setup show the effectiveness of our method. The effects of PLUMCOT are shown to be remarkable in "unseen" constraints. Our dataset is available at https://github.com/papago-lab/HOLLY-benchmark
翻译:词汇约束神经机器翻译(Lexically-constrained NMT, LNMT)旨在将用户提供的术语融入翻译结果中。尽管具有实用优势,现有研究尚未在具有挑战性的真实场景下对LNMT模型进行评估。本文重点关注当前LNMT研究评估过程中两个重要但未被充分探究的问题:模型需要应对训练过程中“同形异义词”或“未见词”这类具有挑战性的词汇约束。为此,我们首先设计了一个同形异义词消歧模块来区分同形异义词的含义。此外,我们提出PLUMCOT方法,该方法通过预训练语言模型整合关于未见词汇约束的上下文丰富信息,并利用复制分数的直接监督来强化指针网络的复制机制。我们还发布了HOLLY基准测试集,用于评估模型应对“同形异义词”和“未见词”词汇约束的能力。在HOLLY及先前测试集上的实验表明了我们方法的有效性。PLUMCOT在“未见词”约束上的效果尤为显著。我们的数据集公开于https://github.com/papago-lab/HOLLY-benchmark。