GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Large language models (LLMs) have greatly impacted the natural language processing (NLP) field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of language models largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to address a variety of prompts accurately. However, the availability and quality of these resources vary by language. While models perform well in English, they often need help with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks. These outcomes emphasize the effectiveness of our dataset in elevating the capabilities of language models for Arabic. Our instruction dataset bridges the performance gap between English and Arabic language models by providing resources that amplify Arabic NLP development. Building on this foundation, we developed a model, GemmAr-7B-V1, specifically tuned to excel at a wide range of Arabic NLP tasks.

翻译：大语言模型（LLMs）对自然语言处理（NLP）领域产生了深远影响，尤其在英语方面。这些模型已展现出理解和生成类人文本的能力。语言模型的成功很大程度上依赖于高质量指令数据集的可用性，这些数据集包含详细的任务描述和相应的回答，对于训练模型准确响应各类提示至关重要。然而，这些资源的可用性和质量因语言而异。尽管模型在英语上表现优异，但由于缺乏针对阿拉伯语特定任务进行微调的数据集，它们在阿拉伯语等语言上往往面临困难。为解决这一问题，我们推出了InstAr-500k，这是一个通过生成和收集涵盖多个领域及指令类型内容而构建的新型阿拉伯语指令数据集。我们通过在多个下游任务上对开源的Gemma-7B模型进行微调来评估该数据集，以提升其功能。基于多项评估，我们的微调模型在多个阿拉伯语NLP基准测试中取得了优异性能。这些结果凸显了我们的数据集在提升语言模型阿拉伯语能力方面的有效性。我们的指令数据集通过提供促进阿拉伯语NLP发展的资源，弥合了英语与阿拉伯语语言模型之间的性能差距。在此基础上，我们开发了专门针对广泛阿拉伯语NLP任务进行优化的模型GemmAr-7B-V1。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日