The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
翻译:大型视觉语言模型(VLMs)的快速发展已在学术基准测试中取得了令人瞩目的成果,主要集中在广泛使用的语言上。然而,当前VLMs在处理低资源语言和多样化文化背景方面的能力仍存在显著差距,这主要是由于缺乏高质量、多样化且经过安全审查的数据。因此,这些模型在理解低资源语言和文化细微差别时,往往难以避免有害内容。为了应对这些局限性,我们推出了Maya,一个开源的多模态多语言模型。我们的贡献包括三个方面:1)基于LLaVA预训练数据集,构建了一个涵盖八种语言的多语言图文预训练数据集;2)对LLaVA数据集中的有害内容进行了全面分析,并在此基础上创建了一个跨八种语言的无害新版本;3)开发了一个支持这些语言的多语言图文模型,提升了视觉语言任务中的文化和语言理解能力。代码可在 https://github.com/nahidalam/maya 获取。