In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.
翻译:本文提出一种两阶段训练方法:首先在平行数据上对预训练大语言模型进行持续预训练,随后使用少量高质量平行数据进行监督微调。为验证该方法的有效性,我们采用38亿参数模型在八种不同格式的平行数据上进行了持续预训练,并在十三个日英/英日翻译测试集上评估了这些方法。实验结果表明:在持续预训练中使用平行数据时,必须交替使用源语言与目标语言语句;且仅当持续预训练数据与推理过程中源/目标语句顺序一致时,对应翻译方向的准确率才会提升。此外,研究发现基于LLM的翻译模型在口语翻译中表现更稳健,且相较于监督式编码器-解码器模型,能以更少训练数据获得更高准确率。实验同时证明:当持续预训练数据由交替排列的源/目标语句构成,且在源语句中添加标签时,模型能达到最高翻译准确率。