This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.
翻译:本文探索了大型视觉语言模型(LVLMs)的偏好蒸馏方法,旨在提升其生成锚定视觉上下文中更有用且更忠实响应的能力。我们首先利用AI标注构建了视觉语言反馈(VLFeedback)数据集。具体而言,响应由从12个LVLMs中采样的模型生成,这些模型以来自不同数据集的多模态指令为条件。我们采用GPT-4V评估生成输出的有用性、视觉忠实性和伦理考量。此外,通过直接偏好优化(DPO)方法将偏好监督蒸馏到Qwen-VL-Chat中。所得模型Silkie在MME基准测试中,感知和认知能力分别实现了6.9%和9.5%的相对提升。Silkie还在MMHal-Bench基准测试中以3.02分创下新的最优成绩,展示了更少的幻觉现象。进一步分析表明,基于VLFeedback数据集的DPO主要提升了LVLMs的细粒度感知和复杂认知能力,相较于人工标注的偏好数据集带来了更全面的改进。