Recently, Multimodal Large Language Models (MLLMs) that enable Large Language Models (LLMs) to interpret images through visual instruction tuning have achieved significant success. However, existing visual instruction tuning methods only utilize image-language instruction data to align the language and image modalities, lacking a more fine-grained cross-modal alignment. In this paper, we propose Position-enhanced Visual Instruction Tuning (PVIT), which extends the functionality of MLLMs by integrating an additional region-level vision encoder. This integration promotes a more detailed comprehension of images for the MLLM. In addition, to efficiently achieve a fine-grained alignment between the vision modules and the LLM, we design multiple data generation strategies to construct an image-region-language instruction dataset. Finally, we present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model. Code and data will be released at https://github.com/PVIT-official/PVIT.
翻译:近期,通过视觉指令微调使大语言模型(LLMs)能够理解图像的多模态大语言模型(MLLMs)取得了显著成功。然而,现有视觉指令微调方法仅利用图像-语言指令数据对齐语言和图像模态,缺乏更细粒度的跨模态对齐。本文提出位置增强视觉指令微调(PVIT),通过整合额外的区域级视觉编码器扩展MLLM功能,促进模型对图像的更细致理解。此外,为实现视觉模块与LLM的高效细粒度对齐,我们设计了多种数据生成策略构建图像-区域-语言指令数据集。最后,通过定量实验和定性分析证明了所提模型的优越性。代码和数据将在https://github.com/PVIT-official/PVIT发布。