Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a (M)ultimodal (Hal)lucination (Detect)ion Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores.
翻译:指令微调的大型视觉语言模型(LVLMs)在跨多种多模态任务(特别是视觉问答(VQA))的泛化方面取得了显著进展。然而,对这些模型而言,生成具有视觉依据的详细响应仍是一项具有挑战性的任务。我们发现,即使当前最先进的LVLMs(如InstructBLIP)仍包含高达30%的幻觉文本,表现为不存在的物体、不真实的描述及不准确的关系。为解决这一问题,我们引入了M-HalDetect数据集,这是一个用于训练和评估幻觉检测与预防模型的多模态幻觉检测数据集。M-HalDetect包含16,000个针对VQA示例的细粒度标注,使其成为首个针对详细图像描述的多模态幻觉全面检测数据集。与以往仅考虑物体幻觉的工作不同,我们额外标注了不真实的实体描述和关系。为展示该数据集在幻觉预防方面的潜力,我们通过新型的细粒度直接偏好优化(FDPO)方法优化了InstructBLIP。我们还从InstructBLIP训练了细粒度多模态奖励模型,并通过best-of-n拒绝采样评估其有效性。我们对FDPO和拒绝采样进行了人工评估,发现它们分别将InstructBLIP中的幻觉率降低了41%和55%。此外,我们的奖励模型可泛化至其他多模态模型,将LLaVA和mPLUG-OWL的幻觉率分别降低了15%和57%,并与人工评估的准确率得分具有强相关性。