Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
翻译:大多数多语言视觉-语言(V&L)研究旨在单个模型中实现多语言与多模态能力。然而,图像多语言描述的稀缺性阻碍了该领域的发展。为克服这一障碍,我们提出ICU(图像描述理解)方法,将V&L任务分解为两个阶段:V&L模型生成英文图像描述,随后多语言语言模型(mLM)将该描述作为替代文本执行跨语言理解。多语言处理的负担从V&L模型转移至mLM。由于多语言文本数据相对更丰富且质量更高,ICU能助力V&L模型突破语言壁垒。在IGLUE基准测试中9种语言的两类任务实验中,我们证明ICU在五种语言上取得了新的最优结果,其余语言的表现亦达到可比水平。