Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models' MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.
翻译:检索增强生成(RAG)通过引入额外信息来增强大语言模型(LLMs)的能力。在机器翻译(MT)领域,先前的研究通常从配对的机器翻译语料库中检索上下文示例,或从知识图谱中检索特定领域的知识,以提升模型的翻译性能。然而,大量世界知识是以非结构化文档的形式组织的,并且可能在不同语言之间并未完全配对。本文研究了利用非结构化文档进行检索增强的机器翻译。具体而言,我们构建了RAGtrans,这是首个用于训练和评估大语言模型检索增强翻译能力的基准测试集。RAGtrans包含通过GPT-4o和人工翻译收集的79K个机器翻译样本。此外,还提供了来自不同语言的文档,为这些样本补充知识。基于RAGtrans,我们进一步提出了一种多任务训练方法,以教导大语言模型在翻译过程中如何利用多语言文档中的信息。该方法利用现有的多语言语料库创建辅助训练目标,无需额外的标注工作。大量实验表明,该方法将大语言模型的性能提升了1.58-3.09 BLEU分和1.00-2.03 COMET分。