Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.
翻译:大型语言模型(LLMs)因广泛的应用场景在自然语言处理(NLP)领域备受关注。然而,针对非英语语言训练LLMs面临显著挑战,主要源于大规模语料获取困难以及所需的计算资源。本文提出一种基于跨语言迁移的LLM——ChatFlow,以成本高效的方式训练大型中文语言模型。我们混合使用中文、英文及平行语料对LLaMA2模型进行持续训练,旨在对齐跨语言表征并促进知识向中文语言模型的专项迁移。此外,我们采用动态数据采样器,使模型从无监督预训练逐步过渡到有监督微调阶段。实验结果表明,该方法加速了模型收敛并取得了优越性能。我们在主流中文与英文基准上评估ChatFlow,结果显示该模型在基于LLaMA-2-7B后训练的中文模型中表现最优。