Domain-specific neural machine translation (NMT) systems (e.g., in educational applications) are socially significant with the potential to help make information accessible to a diverse set of users in multilingual societies. It is desirable that such NMT systems be lexically constrained and draw from domain-specific dictionaries. Dictionaries could present multiple candidate translations for a source word/phrase due to the polysemous nature of words. The onus is then on the NMT model to choose the contextually most appropriate candidate. Prior work has largely ignored this problem and focused on the single candidate constraint setting wherein the target word or phrase is replaced by a single constraint. In this work we present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We achieve this by augmenting training data with multiple dictionary candidates to actively encourage disambiguation during training by implicitly aligning multiple candidate constraints. We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering. We also present comparisons on standard benchmark test datasets. In comparison with existing approaches for lexically constrained and unconstrained NMT, we demonstrate superior performance with respect to constraint copy and disambiguation related measures on all domains while also obtaining improved fluency of up to 2-3 BLEU points on some domains.
翻译:领域特定的神经机器翻译(NMT)系统(例如在教育应用中)具有重要的社会意义,有潜力帮助多语言社会中多样化的用户群体获取信息。此类NMT系统应具备词汇约束能力,并能从领域专用词典中提取词汇。由于词语的多义性,词典可能为一个源语言单词/短语提供多个候选翻译。此时,NMT模型需要根据上下文选择最合适的候选翻译。先前的研究大多忽略了此问题,主要关注单一候选约束设置,即目标词或短语仅被单一约束项替换。本研究提出DictDis,一种能够对词典提供的多个候选翻译进行消歧的词汇约束NMT系统。我们通过用多个词典候选词增强训练数据,借助多候选约束的隐式对齐机制,在训练过程中主动促进消歧学习。我们通过在法规、金融、工程等多个领域的英印地语和英德语句子上进行大量实验,验证了DictDis的实用性。同时,我们在标准基准测试数据集上进行了对比评估。与现有词汇约束及无约束NMT方法相比,DictDis在所有领域均展现出更优的约束复制和消歧相关指标,并在部分领域获得高达2-3个BLEU分数的流畅度提升。