We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly. The joint training of language-only and visual-language instructions with the \emph{same} instruction template effectively improves dialogue performance. Various demos show the ability of continuous dialogue of MultiModal-GPT with humans. Code, dataset, and demo are at https://github.com/open-mmlab/Multimodal-GPT
翻译:我们提出一个名为多模态GPT的视觉语言模型,用于与人类进行多轮对话。该模型能遵循人类的多种指令,例如生成详细描述、统计感兴趣对象的数量,以及回答用户的通用问题。多模态GPT通过参数高效微调方法从OpenFlamingo模型进行训练,在语言模型的交叉注意力部分和自注意力部分均添加了低秩适配器(LoRA)。我们首先构建包含视觉和语言数据的指令模板进行多模态指令调优,使模型能够理解并遵循人类指令。研究发现训练数据质量对对话性能至关重要:若数据中包含过多简短回答,会导致模型对所有指令都给予简短响应。为增强多模态GPT的对话能力,我们联合使用纯语言指令遵循数据进行训练。采用相同指令模板对纯语言和视觉语言指令进行联合训练,能有效提升对话性能。多项演示验证了多模态GPT与人类进行连续对话的能力。相关代码、数据集及演示可访问https://github.com/open-mmlab/Multimodal-GPT获取。