We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.
翻译:本文提出ALLaM:阿拉伯语大语言模型,这是一个旨在支持阿拉伯语语言技术(ALT)生态系统的大语言模型系列。ALLaM在训练过程中审慎考量了语言对齐与大规模知识迁移的价值。我们的自回归仅解码器架构模型证明了如何通过词汇扩展和阿拉伯语-英语混合文本的预训练实现第二语言习得,从而将模型导向新语言(阿拉伯语),同时避免对原始语言(英语)产生灾难性遗忘。此外,我们重点阐述了使用平行/翻译数据对促进语言间知识对齐过程的有效性。最后,我们表明,相较于对齐质量较低的大规模模型,与人类偏好进行广泛对齐能显著提升语言模型的性能。ALLaM在多项阿拉伯语基准测试中取得了最先进的性能,包括MMLU Arabic、ACVA和Arabic Exams。我们的对齐模型在阿拉伯语和英语任务上均较其基础对齐模型有所提升。