Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters. Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.
翻译:针对机器翻译任务对大语言模型进行微调已显示出对整体翻译质量的提升效果。然而,微调对于大语言模型所具备、而神经机器翻译模型通常缺乏的理想行为特性——如可操控性、固有的文档级翻译能力以及生成非直译翻译的能力——会产生何种影响,目前尚不明确。我们对参数规模从70亿到650亿不等的LLaMA和Falcon系列模型进行了广泛的翻译评估。研究结果表明,虽然微调提升了大语言模型的整体翻译质量,但其多项能力出现退化。具体而言,我们观察到模型在执行形式化操控、通过少量示例生成技术翻译以及进行文档级翻译方面的能力均有所下降。另一方面,我们发现模型在经过平行数据微调后生成的翻译文本直译程度降低。我们证明,通过在微调数据中加入单语数据,可以在保持这些能力的同时进一步提升整体翻译质量。我们的发现强调了需要制定能够保留大语言模型在机器翻译中优势的微调策略。