SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin,Chris Liu,Renrui Zhang,Peng Gao,Longtian Qiu,Han Xiao,Han Qiu,Chen Lin,Wenqi Shao,Keqin Chen,Jiaming Han,Siyuan Huang,Yichi Zhang,Xuming He,Hongsheng Li,Yu Qiao

from arxiv, Work in progress. Code and demos are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

翻译：我们提出了SPHINX，一种具有模型权重、调优任务和视觉嵌入联合混合能力的通用多模态大语言模型（MLLM）。首先，为增强视觉-语言对齐能力，我们在预训练阶段解冻大语言模型（LLM），并引入由真实数据和合成数据训练的LLM之间的权重混合策略。通过直接整合两个领域的权重，混合后的LLM能够高效融合多样化语义并具备良好的鲁棒性。其次，为实现多用途能力，我们混合多种任务进行联合视觉指令调优，并设计任务特定指令以避免任务间冲突。除基础视觉问答外，我们纳入更具挑战性的任务，如区域级理解、描述定位、文档布局检测和人体姿态估计，促进不同场景下的相互增强。此外，我们提出从多种网络架构、预训练范式和信息粒度中提取综合视觉嵌入，为语言模型提供更稳健的图像表示。基于所提出的联合混合方法，SPHINX在广泛应用中展现出卓越的多模态理解能力。在此基础上，我们进一步提出高效策略以更好地捕捉高分辨率图像的细粒度外观。通过不同尺度和高分辨率子图像的混合，SPHINX在现有评估基准上获得了卓越的视觉解析与推理性能。我们希望这项工作能为未来MLLM研究中的联合混合探索提供启示。代码已发布于https://github.com/Alpha-VLLM/LLaMA2-Accessory。