This paper explores the imperative need and methodology for developing a localized Large Language Model (LLM) tailored for Arabic, a language with unique cultural characteristics that are not adequately addressed by current mainstream models like ChatGPT. Key concerns additionally arise when considering cultural sensitivity and local values. To this end, the paper outlines a packaged solution, including further pre-training with Arabic texts, supervised fine-tuning (SFT) using native Arabic instructions and GPT-4 responses in Arabic, and reinforcement learning with AI feedback (RLAIF) using a reward model that is sensitive to local culture and values. The objective is to train culturally aware and value-aligned Arabic LLMs that can serve the diverse application-specific needs of Arabic-speaking communities. Extensive evaluations demonstrated that the resulting LLM called `AceGPT' is the SOTA open Arabic LLM in various benchmarks, including instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), as well as the newly-proposed Arabic cultural \& value alignment benchmark. Notably, AceGPT outperforms ChatGPT in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the benchmark's limited scale. % Natural Language Understanding (NLU) benchmark (i.e., ALUE) Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
翻译:本文探讨了开发面向阿拉伯语的本地化大型语言模型(LLM)的必要性与方法论。阿拉伯语具有独特的文化特征,而当前主流模型(如ChatGPT)未能充分涵盖这些特性。此外,在考虑文化敏感性与本地价值观时,关键问题随之产生。为此,本文提出了一套综合性解决方案,包括:使用阿拉伯语文本进行进一步预训练、利用原生阿拉伯语指令及阿拉伯语版GPT-4响应进行监督微调(SFT)、以及采用对本地文化与价值观敏感的奖励模型进行基于AI反馈的强化学习(RLAIF)。目标是训练出具有文化意识且价值观对齐的阿拉伯语LLM,以满足阿拉伯语社区多样化的应用需求。广泛评估表明,最终生成的LLM“AceGPT”在多项基准测试中均达到公开阿拉伯语LLM的最优水平,包括指令遵循基准(即阿拉伯语Vicuna-80与阿拉伯语AlpacaEval)、知识基准(即阿拉伯语MMLU与EXAMs),以及新提出的阿拉伯文化与价值观对齐基准。值得注意的是,尽管Vicuna-80基准规模有限,但经GPT-4评估,AceGPT在该流行基准上的表现仍优于ChatGPT。相关代码、数据与模型已公开于 https://github.com/FreedomIntelligence/AceGPT。