In this paper, we present significant advancements in the pretraining of Mistral 7B, a large-scale language model, using a dataset of 32.6 GB, equivalent to 1.1 billion tokens. We explore the impact of extending the context length, releasing models with context lengths of 4096 and 32768 tokens, and further refining performance with a specialized 16384 context length instruction-tuned model, we called it Malaysian Mistral. Our experiments demonstrate the efficacy of continue pretraining and the influence of extended context lengths on Mistral 7B's language understanding capabilities. Additionally, we release a model specifically tuned with a 16384 context length instruction, showcasing its potential for capturing nuanced language intricacies. Furthermore, our research contributes to the benchmarking of Malaysian Mistral against prominent language models, including ChatGPT3.5 and Claude 2. We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set, particularly when fine-tuned with instructions. All models released at https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c
翻译:本文介绍了我们在大规模语言模型Mistral 7B预训练方面取得的重要进展,使用了32.6 GB(相当于11亿个token)的数据集。我们探究了扩展上下文长度的影响,发布了上下文长度为4096和32768 token的模型,并通过专用16384上下文长度的指令微调模型进一步优化性能,该模型被命名为Malaysian Mistral。实验证明了持续预训练的有效性以及扩展上下文长度对Mistral 7B语言理解能力的影响。此外,我们发布了一个专门针对16384上下文长度指令进行微调的模型,展示了其在捕捉细微语言细节方面的潜力。本研究还将Malaysian Mistral与ChatGPT3.5和Claude 2等主流语言模型进行了基准测试对比。令人信服的结果表明,在Tatabahasa(马来语语法)测试集上,Malaysian Mistral表现出更优性能,尤其是在使用指令微调后。所有模型已在https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c发布。