Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA$^w$ (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.
翻译:现有视觉指令微调方法通常通过文本描述提示大语言模型生成指令遵循数据。尽管这些方法已取得显著效果,但文本描述往往源自图像标注,这些标注通常是粗粒度的。此外,若不观察完整视觉上下文,生成的指令甚至可能与视觉内容相矛盾。为解决这一挑战,我们引入细粒度视觉指令数据集LVIS-Instruct4V,该数据集包含通过提示强大的GPT-4V模型对LVIS图像生成的22万条视觉对齐且上下文感知的指令。通过实验验证和案例研究,我们证明高质量视觉指令数据能显著提升LLaVA-1.5(一种先进的大型多模态模型)在广泛基准测试中的性能。值得注意的是,仅通过将LLaVA-Instruct替换为我们的LVIS-Instruct4V,我们在最具挑战性的LMM基准测试中即取得优于LLaVA的结果,例如LLaVA$^w$(76.7 vs. 70.7)和MM-Vet(40.2 vs. 35.4)。我们在https://github.com/X2FD/LVIS-INSTRUCT4V 公开发布数据和模型。