In production, multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets which the underlying language model had been trained with. To address this challenging degradation, we first collect a lightweight (6k entries) VQA preference dataset where answers were annotated by Gemini for 5 quality metrics in a granular fashion, and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO), and SteerLM. Our findings indicate that the with DPO we are able to surpass instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99 despite small data scale. This enhancement in textual instruction proficiency correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs, restoring and boosting language capability after visual instruction tuning.
翻译:在实际应用中,多模态大语言模型需支持图像与文本模态交替的多轮查询。然而,当前基于视觉问答数据集训练的MLLM可能面临性能退化——VQA数据集缺乏底层语言模型原始文本指令数据集的多样性与复杂性。为解决这一退化难题,我们首先收集了一个轻量级(6000条)VQA偏好数据集,其中答案由Gemini按5项质量指标进行细粒度标注,并系统研究了标准监督微调、拒绝采样、直接偏好优化及SteerLM方法。实验表明,尽管数据规模较小,采用DPO方法可使模型在文本指令遵循能力上超越原始语言模型(MT-Bench得分6.73,对比Vicuna的6.57和LLaVA的5.99)。该文本指令能力的提升同时带来了视觉指令性能的增强(MM-Vet提升4.9%,LLaVA-Bench提升6%),且相比以往RLHF方法,对视觉知识基准的对齐惩罚极小。最终,我们提出基于蒸馏的多模态对齐模型,通过小规模数据集上的细粒度标注,协调MLLM的文本与视觉性能,在视觉指令微调后恢复并提升语言能力。