We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
翻译:我们提出了SPHINX-X,一个基于SPHINX构建的、规模扩展的多模态大语言模型系列。为提升架构与训练效率,我们改进了SPHINX框架:移除了冗余的视觉编码器,通过跳过令牌机制绕过完全填充的子图像,并将多阶段训练简化为单阶段一体化范式。为充分释放多模态大语言模型的潜力,我们构建了一个覆盖语言、视觉及视觉-语言任务的公开资源的综合性多领域多模态数据集。我们进一步通过精心整理的OCR密集型数据集和Set-of-Mark数据集丰富了该集合,从而扩展了数据的多样性与通用性。通过在包括TinyLlama1.1B、InternLM2-7B、LLaMA2-13B和Mixtral8x7B在内的不同基础大语言模型上进行训练,我们获得了一系列在参数量级和多语言能力上各不相同的多模态大语言模型。全面的基准测试表明,多模态性能与数据及参数规模之间存在强相关性。代码与模型发布于 https://github.com/Alpha-VLLM/LLaMA2-Accessory。