Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.
翻译:尽管视觉语言模型作为多功能视觉助手展现出卓越能力,现有VLM框架仍面临两大挑战:(1)预训练和视觉指令微调阶段任务多样性不足;(2)GPT-4合成指令微调数据存在标注错误和偏差。这两类问题导致泛化能力弱、幻觉和灾难性遗忘。为应对这些挑战,我们构建了Vision-Flan——当前公开可用的最多样化视觉指令微调数据集,包含187个不同任务和1,664,261个实例(均源自学术数据集),每个任务配备专家编写的指令。此外,我们提出两阶段指令微调框架:VLM首先在Vision-Flan上微调,再在GPT-4合成数据上二次微调。实验证明,该两阶段框架显著优于传统单阶段视觉指令微调框架,在多模态评估基准上实现最先进性能。最后,我们深入分析视觉指令微调机制,发现:(1)GPT-4合成数据并非实质增强VLM能力,而是将模型响应调整至符合人类偏好格式;(2)极小量(如1000条)GPT-4合成数据即可有效对齐VLM响应与人类偏好;(3)视觉指令微调主要帮助大语言模型理解视觉特征。