Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. Our data and code are available at https://github.com/YiyangZhou/CSR.
翻译:大型视觉语言模型(LVLMs)通过指令微调整合预训练大型语言模型(LLMs)与视觉模型,已取得显著进展。然而,LVLMs常出现幻觉现象,即生成的文本响应在语言上看似合理,却与输入图像相矛盾,这表明图像与文本对之间存在错位。这种错位源于模型倾向于优先处理文本信息而非视觉输入,即使语言模型和视觉表征本身质量较高。现有方法利用额外模型或人工标注来构建偏好数据,并通过偏好优化增强模态对齐。这些方法可能无法有效反映目标LVLM的偏好,且构建的偏好数据易于区分。本研究针对这些挑战提出校准自奖励(CSR)方法,使模型能够通过迭代生成候选响应、评估各响应奖励值并构建偏好数据进行微调来实现自我改进。在奖励建模中,我们采用分步策略并将视觉约束融入自奖励过程,以更加强调视觉输入。实证结果表明,CSR在十项基准测试和任务中提升了性能并减少了幻觉现象,较现有方法实现7.62%的显著改进。在温和假设下进行的严格理论分析进一步支持了我们的实证结果,验证了在自奖励范式中引入视觉约束的有效性。此外,CSR展现出与不同视觉语言模型的兼容性,以及通过迭代微调逐步提升性能的能力。我们的数据与代码公开于https://github.com/YiyangZhou/CSR。