We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.
翻译:我们介绍了Atlas-Chat,这是首个专门为阿拉伯语方言开发的大语言模型集合。聚焦于摩洛哥阿拉伯语(亦称Darija),我们通过整合现有Darija语言资源、人工与合成方式创建新数据集,并在严格质量控制下翻译英文指令,构建了我们的指令数据集。基于该数据集微调的Atlas-Chat-9B和2B模型,在遵循Darija指令和执行标准NLP任务方面展现出卓越能力。值得注意的是,在我们新推出的覆盖判别式与生成式任务的Darija评估套件(如DarijaMMLU)中,我们的模型性能均超越包括LLaMa、Jais和AceGPT在内的前沿模型及阿拉伯语专用大语言模型,例如在DarijaMMLU上比规模更大的13B模型性能提升13%。此外,我们通过实验分析了不同微调策略与基础模型选择,以确定最优配置。所有资源均已公开,我们相信这项工作为常被当代大语言模型忽视的低资源语言变体,提供了全面的指令微调设计方法论。