We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.
翻译:我们提出Sailor系列语言模型,参数量从0.5B到7B不等,专为东南亚语言定制。这些模型基于Qwen1.5(一款出色的多语言大语言模型)进行持续预训练。Sailor模型从Qwen1.5中接收200B至400B tokens,主要涵盖英语、中文、越南语、泰语、印尼语、马来语和老挝语。训练中采用了多种技术,包括用于提升模型鲁棒性的BPE dropout、激进的数据清洗与去重,以及利用小型代理模型优化数据配比。在四项典型任务上的实验结果表明,Sailor模型在常识推理、问答、阅读理解和考试等多个基准测试中展现出强劲性能。秉承开源精神,我们通过本文分享研究见解,以期激发更广泛的兴趣,推动面向多语言场景的大语言模型开发。