As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.
翻译:随着大型视觉语言模型(LVLM)的快速发展,对高质量、多样化数据以对齐这些模型的需求变得日益关键。然而,通过人工监督创建此类数据成本高昂且耗时。本文研究了利用人工智能反馈来扩展监督以对齐LVLM的有效性。我们介绍了VLFeedback,这是首个大规模视觉语言反馈数据集,包含超过82K个由现成模型生成的多模态指令和全面的推理过程,无需人工标注。为了评估人工智能反馈在视觉语言对齐方面的有效性,我们训练了Silkie,一个通过在VLFeedback上进行直接偏好优化而微调的LVLM。Silkie在帮助性、视觉忠实性和安全性指标方面展现出卓越的性能。它在感知和认知任务上分别超越了其基础模型6.9%和9.5%,在MMHal-Bench上减少了幻觉问题,并表现出增强的抵御红队攻击的能力。此外,我们的分析强调了人工智能反馈的优势,特别是在促进偏好多样性以提供更全面的改进方面。我们的数据集、训练代码和模型可在 https://vlf-silkie.github.io 获取。