Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
翻译:近年来,多模态大语言模型在生成和理解图像到文本内容方面的能力得到了显著提升。尽管取得了这些成功,但由于其他语言高质量多模态资源的匮乏,相关进展主要局限于英语。这一限制阻碍了阿拉伯语等语言中具有竞争力模型的发展。为缓解这一状况,我们推出了一款高效的阿拉伯语多模态助手Dallah,该模型基于LLaMA-2的先进语言模型构建,以促进多模态交互。Dallah在阿拉伯语多模态大语言模型中展现了最先进的性能。通过对六种阿拉伯语方言进行微调,Dallah展示了其处理融合文本与视觉元素的复杂方言交互的能力。该模型在两项基准测试中表现优异:一项评估其在现代标准阿拉伯语上的性能,另一项专门设计用于评估方言响应能力。除了在多模态交互任务中的强大性能外,Dallah有望为开发更具方言感知能力的阿拉伯语多模态大语言模型开辟新的道路。