Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
翻译:尽管多模态大语言模型(MLLMs)近期取得了进展,但其发展主要集中于英语及西方中心的数据集与任务,导致世界上大多数语言及多样文化语境未能得到充分体现。本文介绍盘古大陆(Pangea)——一个基于PangeaIns数据集训练的多语言多模态大语言模型,该数据集包含涵盖39种语言的600万条多样化指令。PangeaIns具有以下特征:1)高质量的英语指令,2)经过精心机器翻译的指令,以及3)与文化相关的多模态任务,以确保跨文化覆盖。为严格评估模型能力,我们提出了PangeaBench——一个涵盖14个数据集、包含47种语言的综合性评估套件。实验结果表明,盘古大陆在多语言环境及多样化文化语境中显著优于现有开源模型。消融研究进一步揭示了英语数据比例、语言普及度以及多模态训练样本数量对整体性能的重要性。我们完全开源了数据、代码及训练好的模型检查点,以促进包容性强、鲁棒性好的多语言多模态大语言模型的开发,从而在更广泛的语言与文化谱系中推动公平性与可及性。