This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed 'AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks, including the instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), and the newly introduced Arabic Cultural and Value Alignment benchmark. Notably, AceGPT outperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the benchmark's limited scale. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
翻译:本文致力于开发专门针对阿拉伯语的本地化大型语言模型(LLM)。阿拉伯语蕴含独特的文化特征,而当前主流模型对此关注不足,尤其在处理文化敏感性与本地价值观方面存在显著问题。为此,本文提出一套综合性解决方案:包括使用阿拉伯语文本进行进一步预训练、利用原生阿拉伯语指令与GPT-4阿拉伯语响应进行监督微调(SFT),以及通过适配本地文化与价值观的奖励模型进行基于人工智能反馈的强化学习(RLAIF)。其目标是培育具备文化认知且符合价值观的阿拉伯语LLM,以满足阿拉伯语社区多样化且特定应用场景的需求。全面评估表明,所生成的模型“AceGPT”在多项基准测试中树立了开放阿拉伯语LLM的最新水平标准,涵盖指令遵循基准(即阿拉伯语Vicuna-80与阿拉伯语AlpacaEval)、知识基准(即阿拉伯语MMLU与EXAMs)以及新引入的阿拉伯语文化与价值观对齐基准。值得注意的是,在使用GPT-4评估的Vicuna-80流行基准中,AceGPT的表现优于Turbo,尽管该基准的规模有限。代码、数据与模型详见https://github.com/FreedomIntelligence/AceGPT。