Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https://llava-rlhf.github.io.
翻译:大规模多模态模型(LMM)跨模态构建时,两种模态间的错位会导致"幻觉"现象,即生成与上下文多模态信息不符的文本输出。为解决多模态错位问题,我们将基于人类反馈的强化学习(RLHF)从文本领域迁移至视觉-语言对齐任务:要求标注员比较两个响应并判定幻觉更严重的那个,随后训练视觉-语言模型最大化模拟的人类偏好。我们提出名为"事实增强RLHF"的新型对齐算法,通过为奖励模型注入图像标题、真实多选选项等额外事实信息,缓解RLHF中的奖励篡改现象并进一步优化性能。同时利用已有的图文人工标注对增强GPT-4生成的训练数据(用于视觉指令微调),提升模型通用能力。为在真实场景中评估所提出方法,我们构建了专门针对幻觉惩罚的新型评估基准MMHAL-BENCH。作为首个经RLHF训练的大规模多模态模型,我们在LLaVA-Bench数据集上达到文本专用GPT-4 94%的性能水平(此前最优方法仅达87%),在MMHAL-BENCH上相较其他基线提升60%。代码、模型及数据已在https://llava-rlhf.github.io 开源。