Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/
翻译:大型语言模型已展现出其作为各类语言相关应用通用接口的卓越能力。受此启发,我们旨在构建一个统一接口来完成包括图像描述、视觉问答和视觉定位在内的多项视觉-语言任务。其挑战在于如何利用单一模型,通过简单的多模态指令有效执行多样化的视觉-语言任务。为此,我们提出MiniGPT-v2,该模型可作为统一接口以更好地处理各类视觉-语言任务。我们在模型训练中为不同任务引入唯一标识符。这些标识符使模型能够轻松区分各任务指令,并提升每个任务的学习效率。经过三阶段训练后,实验结果表明,与其他视觉-语言通用模型相比,MiniGPT-v2在多个视觉问答和视觉定位基准测试中取得了强劲性能。我们的模型和代码已开源至https://minigpt-v2.github.io/。