We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed "super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.
翻译:本文提出VisionLLM v2,一种端到端的通用多模态大模型(MLLM),它将视觉感知、理解与生成统一在单一框架内。与传统仅能输出文本的MLLM不同,VisionLLM v2显著拓宽了其应用范围。该模型不仅在传统的视觉问答(VQA)任务中表现优异,还能胜任开放域、跨领域的视觉任务,如目标定位、姿态估计以及图像生成与编辑。为此,我们提出了一种称为“超级链接”的新型信息传输机制,作为连接MLLM与任务特定解码器的媒介。该机制不仅允许MLLM与多个下游解码器之间灵活传输任务信息与梯度反馈,还能有效解决多任务场景中的训练冲突问题。此外,为支持多样化的任务,我们精心收集并整理了来自数百个公开视觉及视觉-语言任务的训练数据。通过这种方式,我们的模型能够在数百种视觉-语言任务上进行端到端的联合训练,并借助不同的用户提示,使用一组共享参数泛化至这些任务,其性能可与任务专用模型相媲美。我们相信VisionLLM v2将为MLLM的泛化能力研究提供新的视角。