In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
翻译:为构建更具包容性的视觉语言模型(VLM),本研究提出名为PALO的大型多语言多模态模型。PALO支持英语、中文、印地语、西班牙语、法语、阿拉伯语、孟加拉语、俄语、乌尔都语和日语等10种主要语言的视觉推理能力,覆盖约50亿人口(占全球人口的65%)。我们的方法采用半自动化翻译策略,通过微调的大语言模型将多模态指令数据集从英语迁移至目标语言,在确保高语言保真度的同时,通过最小化人工干预实现可扩展性。多样化指令集的整合有助于提升多语言整体性能,尤其对印地语、阿拉伯语、孟加拉语和乌尔都语等低资源语言效果显著。所训练的模型涵盖三个参数量级(1.7B、7B和13B),通过对比强基线方法的显著改进验证了模型的泛化性与可扩展性。我们还首次提出面向多语言多模态的基准数据集,供后续方法评估跨语言的视觉语言推理能力。代码:https://github.com/mbzuai-oryx/PALO。