Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.
翻译:词汇歧义是机器翻译(\mt)中一个具有挑战性且普遍存在的问题。我们提出了一种简单且可扩展的方法,通过在神经\mt中引入少量句子间上下文来解决翻译歧义。该方法无需词义标注,也不需要改变标准模型架构。由于大多数\mt训练数据缺乏实际文档上下文,我们为每个输入收集相关句子以构建伪文档。随后,将伪文档中的显著词汇编码为每个源句的前缀,以约束翻译的生成过程。为评估效果,我们发布了\docmucow数据集——基于英语-德语\mucow \cite{raganato-etal-2020-evaluation}并补充文档ID的翻译消歧挑战集。大量实验表明,我们的方法在翻译歧义词时优于强句子级基线,与文档级基线性能相当,同时降低了训练成本。