Large language models (LLMs) have greatly impacted the natural language processing (NLP) field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of language models largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to address a variety of prompts accurately. However, the availability and quality of these resources vary by language. While models perform well in English, they often need help with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks. These outcomes emphasize the effectiveness of our dataset in elevating the capabilities of language models for Arabic. Our instruction dataset bridges the performance gap between English and Arabic language models by providing resources that amplify Arabic NLP development. Building on this foundation, we developed a model, GemmAr-7B-V1, specifically tuned to excel at a wide range of Arabic NLP tasks.
翻译:大语言模型(LLMs)对自然语言处理(NLP)领域产生了深远影响,尤其在英语方面。这些模型已展现出理解和生成类人文本的能力。语言模型的成功很大程度上依赖于高质量指令数据集的可用性,这些数据集包含详细的任务描述和相应的回答,对于训练模型准确响应各类提示至关重要。然而,这些资源的可用性和质量因语言而异。尽管模型在英语上表现优异,但由于缺乏针对阿拉伯语特定任务进行微调的数据集,它们在阿拉伯语等语言上往往面临困难。为解决这一问题,我们推出了InstAr-500k,这是一个通过生成和收集涵盖多个领域及指令类型内容而构建的新型阿拉伯语指令数据集。我们通过在多个下游任务上对开源的Gemma-7B模型进行微调来评估该数据集,以提升其功能。基于多项评估,我们的微调模型在多个阿拉伯语NLP基准测试中取得了优异性能。这些结果凸显了我们的数据集在提升语言模型阿拉伯语能力方面的有效性。我们的指令数据集通过提供促进阿拉伯语NLP发展的资源,弥合了英语与阿拉伯语语言模型之间的性能差距。在此基础上,我们开发了专门针对广泛阿拉伯语NLP任务进行优化的模型GemmAr-7B-V1。