In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called \textsc{Palo}. \textsc{Palo} offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of $\sim$5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
翻译:为构建更具包容性的视觉-语言模型(VLM),本研究提出了一种名为\textsc{Palo}的大型多语言多模态模型。\textsc{Palo}支持英语、中文、印地语、西班牙语、法语、阿拉伯语、孟加拉语、俄语、乌尔都语和日语等10种主要语言的视觉推理能力,覆盖约50亿人口(占全球人口的65%)。我们的方法采用半自动化翻译策略,通过微调的大语言模型将多模态指令数据集从英语适配至目标语言,从而在保证高语言保真度的同时,以最小人工成本实现可扩展性。多样化指令集的整合有助于提升多语言整体性能,尤其对印地语、阿拉伯语、孟加拉语和乌尔都语等代表性不足的语言效果显著。最终模型在三种规模(1.7B、7B和13B参数)上训练,展示了泛化性与可扩展性,相比强基线模型取得了实质性改进。此外,我们提出了首个多语言多模态基准测试,供后续方法评估其跨语言的视觉-语言推理能力。代码:https://github.com/mbzuai-oryx/PALO。