This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed 'AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks, including the instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), and the newly introduced Arabic Cultural and Value Alignment benchmark. Notably, AceGPT outperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the benchmark's limited scale. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
翻译:本文致力于开发针对阿拉伯语的本地化大语言模型(LLM)。阿拉伯语作为一种蕴含独特文化特征的语言,当前主流模型尚无法充分满足其需求。在处理文化敏感性和本地价值观时,存在显著关切。为解决这一问题,本文提出了一套综合解决方案,包括使用阿拉伯语文本进行进一步预训练、利用原生阿拉伯语指令及GPT-4阿拉伯语回答进行监督微调(SFT),以及采用基于AI反馈的强化学习(RLAIF),并使用契合本地文化与价值观的奖励模型。其目标是培养具有文化认知且价值观对齐的阿拉伯语LLM,以满足阿拉伯语社区多样化、特定应用的需求。全面评估表明,最终模型"AceGPT"在多项基准测试中为开源阿拉伯语LLM树立了最先进标准,这些测试包括指令遵循基准(即Arabic Vicuna-80和Arabic AlpacaEval)、知识基准(即Arabic MMLU和EXAMs)以及新引入的阿拉伯文化与价值观对齐基准。值得注意的是,尽管基准规模有限,但在使用GPT-4评估的热门Vicuna-80基准测试中,AceGPT的表现优于Turbo。代码、数据及模型已开源:https://github.com/FreedomIntelligence/AceGPT。