Although image captioning has a vast array of applications, it has not reached its full potential in languages other than English. Arabic, for instance, although the native language of more than 400 million people, remains largely underrepresented in this area. This is due to the lack of labeled data and powerful Arabic generative models. We alleviate this issue by presenting a novel vision-language model dedicated to Arabic, dubbed \textit{Violet}. Our model is based on a vision encoder and a Gemini text decoder that maintains generation fluency while allowing fusion between the vision and language components. To train our model, we introduce a new method for automatically acquiring data from available English datasets. We also manually prepare a new dataset for evaluation. \textit{Violet} performs sizeably better than our baselines on all of our evaluation datasets. For example, it reaches a CIDEr score of $61.2$ on our manually annotated dataset and achieves an improvement of $13$ points on Flickr8k.
翻译:尽管图像描述具有广泛的应用前景,但在英语以外的语言中尚未充分实现其潜力。以阿拉伯语为例,虽然该语言拥有超过4亿母语使用者,但在该领域仍严重缺乏代表性。这主要源于标注数据的匮乏和强大阿拉伯语生成模型的缺失。我们通过提出一个专为阿拉伯语设计的新型视觉语言模型Violet来缓解这一问题。该模型基于视觉编码器和双子文本解码器,在保持生成流畅性的同时实现了视觉与语言组件的融合。为训练该模型,我们提出了一种从现有英语数据集自动获取数据的新方法,并手工构建了评估数据集。实验表明,Violet在所有评估数据集上的表现均显著优于基线模型。例如,在其手工标注数据集上达到61.2的CIDEr分数,并在Flickr8k数据集上实现了13分的性能提升。