Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying language model was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
翻译:多模态大语言模型(MLLMs)在生产环境中被期望支持图像与文本模态交替的多轮查询。然而,当前使用视觉问答(VQA)数据集训练的MLLMs可能面临性能退化问题,因为VQA数据集缺乏底层语言模型训练时所使用的原始文本指令数据集的多样性和复杂性。为应对此退化,我们首先收集了一个轻量级的、包含5千样本的VQA偏好数据集,其中答案由Gemini根据五个质量指标进行细粒度标注,并研究了标准监督微调、拒绝采样、直接偏好优化(DPO)以及SteerLM算法。我们的研究结果表明,通过DPO,我们能够超越语言模型原有的指令跟随能力,在MT-Bench上获得6.73分,优于Vicuna的6.57分和LLaVA的5.99分。这种文本指令跟随能力的提升与视觉指令性能的增强相关(在MM-Vet上提升4.9%,在LLaVA-Bench上提升6%),且相较于先前的RLHF方法,在视觉知识基准测试上仅产生最小的对齐代价。总之,我们提出了一种基于蒸馏的多模态对齐模型,该模型在小数据集上利用细粒度标注,在视觉指令调优后恢复并提升了MLLM的语言能力。