Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

翻译：通过使用机器生成的指令遵循数据对大型语言模型（LLM）进行指令微调，已提升了模型在新任务上的零样本能力，但这一思想在多模态领域尚未得到充分探索。本文首次尝试仅利用基于语言的GPT-4生成多模态语言-图像指令遵循数据。通过对此类生成数据进行指令微调，我们提出了LLaVA：大型语言与视觉助手——一种端到端训练的大型多模态模型，该模型连接视觉编码器与LLM，以实现通用的视觉与语言理解。初步实验表明，LLaVA展现出令人印象深刻的多模态对话能力，有时能在未见图像/指令上呈现出多模态GPT-4的行为特征，并在合成多模态指令遵循数据集上达到GPT-4相对分数的85.1%。当在Science QA上进行微调时，LLaVA与GPT-4的协同作用实现了92.53%的新最先进准确率。我们将GPT-4生成的视觉指令微调数据、模型及代码库公开发布。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

阿里巴巴达摩院《从 mPLUG-Owl 浅析类GPT4模型的技术细节》

专知会员服务

57+阅读 · 2023年5月12日

5400亿！谷歌「Pathways语言模型」发布，能理解做推理生成代码

专知会员服务

40+阅读 · 2022年4月5日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日