Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While large language models (LLMs) have demonstrated impressive general understanding and generation abilities, their potential in multi-domain MT is under-explored. We establish a comprehensive benchmark for multi-domain translation, featuring 25 German$\Leftrightarrow$English and 22 Chinese$\Leftrightarrow$English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German$\rightarrow$English distinct out-of-domain tests.
翻译:实现跨不同领域的高质量机器翻译仍是一项重大挑战,这主要归因于各领域可用的平行训练数据有限且分布不均。尽管大语言模型已展现出令人印象深刻的通用理解和生成能力,但其在多领域机器翻译中的潜力尚未得到充分探索。我们建立了一个全面的多领域翻译基准,分别包含25个德语$\Leftrightarrow$英语和22个中文$\Leftrightarrow$英语测试集,覆盖15个领域。我们对主流大语言模型的评估显示,与传统机器翻译系统相比存在明显性能差距,凸显了在领域受限语料上进行微调后出现的领域过拟合和灾难性遗忘问题。为缓解此问题,我们提出了一种领域思维链微调技术,该技术利用大语言模型固有的多领域智能来提升翻译性能。此方法启发大语言模型从源文本中感知领域信息,随后将其作为指导翻译过程的有效提示。尽管仅在四个领域的小型数据集上进行训练,我们的思维链微调方法相较于传统微调,在翻译准确性和领域鲁棒性方面均取得了显著提升——在超过20项德语$\rightarrow$英语的域外独立测试中平均BLEU分数提高了1.53分,即为明证。