Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
翻译:近期进展显著提升了多模态大语言模型在图像到文本内容的生成与理解能力。尽管取得这些成功,由于其他语言高质量多模态资源的匮乏,相关进展主要局限于英语。这一限制阻碍了如阿拉伯语等语言中具有竞争力模型的开发。为缓解此状况,我们引入了一个高效的阿拉伯语多模态助手,命名为Dallah,其利用基于LLaMA-2的先进语言模型以促进多模态交互。Dallah在阿拉伯语多模态大语言模型中展现了最先进的性能。通过对六种阿拉伯语方言进行微调,Dallah展示了其处理融合文本与视觉元素的复杂方言交互的能力。该模型在两项基准测试中表现优异:一项评估其在现代标准阿拉伯语上的性能,另一项专门设计用于评估方言响应能力。除了在多模态交互任务中的强大性能外,Dallah有望为阿拉伯语多方言感知多模态大语言模型的进一步发展铺平道路。