Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in real-world translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.
翻译:文档级神经机器翻译(NMT)已在多个数据集上超越句子级NMT。然而,由于缺乏大规模通用领域的文档级NMT训练数据,文档级NMT尚未被广泛应用于实际翻译系统。我们研究了利用Paracrawl学习文档级翻译的有效性。Paracrawl是一个从互联网爬取的大规模平行语料库,包含来自不同领域的数据。官方发布的Paracrawl语料库以平行句子(从平行网页中提取)的形式呈现,因此先前的研究仅将Paracrawl用于句子级翻译学习。在本研究中,我们通过自动句子对齐从Paracrawl平行网页中提取平行段落,并将这些平行段落作为平行文档用于训练文档级翻译模型。我们证明,仅使用Paracrawl中的平行段落训练的文档级NMT模型可用于翻译TED、新闻和Europarl中的真实文档,其性能优于句子级NMT模型。我们还进行了针对性代词评估,结果表明,使用Paracrawl数据训练的文档级模型有助于实现上下文感知的代词翻译。