Existing large language models (LLMs) for machine translation are typically fine-tuned on sentence-level translation instructions and achieve satisfactory performance at the sentence level. However, when applied to document-level translation, these models face a significant challenge, particularly when dealing with documents containing over 512 tokens. This challenge arises from the issue of sentence-level coverage, where subsequent sentences in the document remain untranslated. As a result, the document-level translation capability of LLMs fine-tuned on sentence-level translation instructions is significantly limited. We conjecture that the primary cause of LLMs' weak document-level translation performance is the absence of document-to-document mapping ability. To address the issue, we propose an approach that combines sentence-level and document-level translation instructions of varying lengths to fine-tune LLMs. Our proposed translation mixed-instructions enable LLMs (Llama-2~7B and 13B) to maintain consistent translation performance from the sentence level to documents containing as many as 2048 tokens. Extensive experimental results show that the proposed approach significantly enhances the document-level translation capabilities of LLMs on 10 language pairs, effectively mitigating the sentence-level coverage issue in document-level translation. Experimentation on discourse phenomena has demonstrated that our document-level translation approach significantly improves translation quality, both in terms of BLEU score and discourse coherence.
翻译:现有的大语言模型(LLMs)在机器翻译任务中通常基于句子级翻译指令进行微调,并在句子级别取得令人满意的性能。然而,当应用于文档级翻译时,这些模型面临显著挑战,尤其是在处理包含超过512个Token的文档时。这一挑战源于句子级覆盖问题,即文档中后续句子未被翻译。因此,基于句子级翻译指令微调的大语言模型在文档级翻译能力上受到严重限制。我们推测,导致大语言模型文档级翻译性能薄弱的主要原因在于其缺乏文档到文档的映射能力。为解决此问题,我们提出一种方法,通过结合不同长度的句子级与文档级翻译指令对LLMs进行微调。我们提出的翻译混合指令使大语言模型(Llama-2~7B和13B)能从句子级到包含多达2048个Token的文档保持一致的翻译性能。大量实验结果表明,所提方法在10个语言对上显著提升了LLMs的文档级翻译能力,有效缓解了文档级翻译中的句子级覆盖问题。对话语现象的实验证明,我们的文档级翻译方法在BLEU分数和话语连贯性两方面均显著提高了翻译质量。