Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder

Although image captioning has a vast array of applications, it has not reached its full potential in languages other than English. Arabic, for instance, although the native language of more than 400 million people, remains largely underrepresented in this area. This is due to the lack of labeled data and powerful Arabic generative models. We alleviate this issue by presenting a novel vision-language model dedicated to Arabic, dubbed \textit{Violet}. Our model is based on a vision encoder and a Gemini text decoder that maintains generation fluency while allowing fusion between the vision and language components. To train our model, we introduce a new method for automatically acquiring data from available English datasets. We also manually prepare a new dataset for evaluation. \textit{Violet} performs sizeably better than our baselines on all of our evaluation datasets. For example, it reaches a CIDEr score of $61.2$ on our manually annotated dataset and achieves an improvement of $13$ points on Flickr8k.

翻译：尽管图像描述具有广泛的应用前景，但在英语以外的语言中尚未充分实现其潜力。以阿拉伯语为例，虽然该语言拥有超过4亿母语使用者，但在该领域仍严重缺乏代表性。这主要源于标注数据的匮乏和强大阿拉伯语生成模型的缺失。我们通过提出一个专为阿拉伯语设计的新型视觉语言模型Violet来缓解这一问题。该模型基于视觉编码器和双子文本解码器，在保持生成流畅性的同时实现了视觉与语言组件的融合。为训练该模型，我们提出了一种从现有英语数据集自动获取数据的新方法，并手工构建了评估数据集。实验表明，Violet在所有评估数据集上的表现均显著优于基线模型。例如，在其手工标注数据集上达到61.2的CIDEr分数，并在Flickr8k数据集上实现了13分的性能提升。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日