In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
翻译:在机器翻译中,大语言模型(LLMs)普遍表现不及传统编码器-解码器系统,因此应用有限。然而,LLMs在建模上下文信息方面表现出色,使其天然适用于句子间连贯性至关重要的文档级翻译任务。尽管具有这一潜力,基于LLMs的文档级机器翻译仍面临两个关键挑战:(1)大规模、高质量的文档级平行语料稀缺;(2)LLMs在生成过程中容易产生幻觉和遗漏。为解决这些问题,我们提出一种利用LLM增强文档级数据的两阶段微调策略。首先,通过LLM将摘要数据转换为文档级平行数据以实现数据增强,并利用sacreBLEU、COMET和基于LaBSE的余弦相似度等多重指标进行过滤——以提升数据质量。最后,采用两阶段微调策略:先在丰富的句子级机器翻译资源上进行微调,再在过滤后的文档级语料库上进行微调。