Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
翻译:尽管多模态大语言模型(MLLMs)近期取得了进展,但其发展主要集中于英语和西方中心的数据集与任务,导致世界上大多数语言和多样化的文化背景未能得到充分体现。本文介绍了盘古大陆,一个基于盘古指令集训练的多语言多模态大语言模型。盘古指令集是一个涵盖39种语言的多样化600万条指令数据集,其特点包括:1)高质量的英语指令,2)精心机器翻译的指令,以及3)与文化相关的多模态任务,以确保跨文化覆盖。为了严格评估模型的能力,我们引入了盘古评测基准,这是一个包含14个数据集、覆盖47种语言的综合性评估套件。结果表明,盘古大陆在多语言环境和多样化文化背景下的表现显著优于现有的开源模型。消融研究进一步揭示了英语数据比例、语言普及度以及多模态训练样本数量对整体性能的重要性。我们将完全开源我们的数据、代码和训练好的模型检查点,以促进包容且鲁棒的多语言MLLMs的发展,在更广泛的语言和文化谱系中推动公平性与可及性。