Calibrated Self-Rewarding Vision Language Models

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. Our data and code are available at https://github.com/YiyangZhou/CSR.

翻译：大型视觉语言模型通过指令微调整合预训练大型语言模型与视觉模型，已取得显著进展。然而，这些模型常出现幻觉现象：生成的文本响应在语言上看似合理，却与输入图像内容相矛盾，这表明图文对之间存在未对齐问题。这种未对齐源于模型倾向于优先处理文本信息而忽视视觉输入，即使语言模型和视觉表征本身质量均很高。现有方法通常依赖额外模型或人工标注来构建偏好数据，并通过偏好优化增强模态对齐。但这些方法可能无法有效反映目标视觉语言模型自身的偏好，导致构建的偏好数据易于被区分。本研究提出校准式自奖励方法以应对上述挑战，该方法使模型能够通过迭代生成候选响应、评估各响应奖励值、并构建用于微调的偏好数据来实现自我改进。在奖励建模中，我们采用分步策略并将视觉约束融入自奖励过程，以强化对视觉输入的关注。实验结果表明，CSR在十项基准测试和任务中有效提升了性能并减少了幻觉现象，相较现有方法实现了7.62%的显著改进。在温和假设条件下，严格的理论分析进一步验证了在自奖励范式中引入视觉约束的有效性。此外，CSR展现出与不同视觉语言模型的兼容性，以及通过迭代微调持续提升性能的能力。相关数据与代码已发布于https://github.com/YiyangZhou/CSR。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日