Detecting and Preventing Hallucinations in Large Vision Language Models

Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a (M)ultimodal (Hal)lucination (Detect)ion Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores.

翻译：指令微调的大型视觉语言模型（LVLMs）在跨多种多模态任务（特别是视觉问答（VQA））的泛化方面取得了显著进展。然而，对这些模型而言，生成具有视觉依据的详细响应仍是一项具有挑战性的任务。我们发现，即使当前最先进的LVLMs（如InstructBLIP）仍包含高达30%的幻觉文本，表现为不存在的物体、不真实的描述及不准确的关系。为解决这一问题，我们引入了M-HalDetect数据集，这是一个用于训练和评估幻觉检测与预防模型的多模态幻觉检测数据集。M-HalDetect包含16,000个针对VQA示例的细粒度标注，使其成为首个针对详细图像描述的多模态幻觉全面检测数据集。与以往仅考虑物体幻觉的工作不同，我们额外标注了不真实的实体描述和关系。为展示该数据集在幻觉预防方面的潜力，我们通过新型的细粒度直接偏好优化（FDPO）方法优化了InstructBLIP。我们还从InstructBLIP训练了细粒度多模态奖励模型，并通过best-of-n拒绝采样评估其有效性。我们对FDPO和拒绝采样进行了人工评估，发现它们分别将InstructBLIP中的幻觉率降低了41%和55%。此外，我们的奖励模型可泛化至其他多模态模型，将LLaVA和mPLUG-OWL的幻觉率分别降低了15%和57%，并与人工评估的准确率得分具有强相关性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日