Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed \textit{Peacock}, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce ~\textit{Henna}, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs.The GitHub repository for the \textit{Peacock} project is available at \url{https://github.com/UBC-NLP/peacock}.
翻译:多模态大语言模型(MLLMs)已被证明在需要复杂推理和语言理解的各种任务中表现有效。然而,由于缺乏英语以外的语言的高质量多模态资源,MLLMs的成功仍相对局限于基于英语的场景。这为开发其他语言(包括阿拉伯语等拥有庞大使用人群的语言)的同类模型带来了重大挑战。为解决这一难题,我们引入了一个全面的阿拉伯语多模态大语言模型家族,名为 \textit{Peacock},具备强大的视觉与语言能力。通过全面的定性与定量分析,我们展示了模型在各种视觉推理任务上的稳健性能,并进一步揭示了其新兴的方言处理潜力。此外,我们引入了 \textit{Henna},这是一个专门针对阿拉伯文化相关方面评估MLLMs的新基准,为文化感知型阿拉伯语MLLMs奠定了基石。\textit{Peacock} 项目的GitHub仓库地址为 \url{https://github.com/UBC-NLP/peacock}。