Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs crosslingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
翻译:大多数多语言视觉与语言(V&L)研究旨在单个模型中实现多语言和多模态能力。然而,图像的多语言描述文本稀缺性阻碍了相关研究进展。为克服这一障碍,我们提出ICU(Image Caption Understanding,图像描述理解)方法,将视觉与语言任务分解为两个阶段:首先由视觉与语言模型生成英文图像描述,随后多语言语言模型(mLM)将该描述作为替代文本进行跨语言语言理解。多语言处理负担得以从视觉与语言模型转移至多语言语言模型。由于多语言文本数据在数量和质量上相对更具优势,ICU能有效帮助视觉与语言模型跨越语言障碍。在IGLUE基准测试涵盖9种语言的两项任务实验中,我们证明ICU在五种语言上取得新的最优结果,其余语言也达到可比较的先进水平。