VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by "hallucinations", where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models exhibit significant issues with hallucinations; ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.

翻译：多模态大语言模型（MLLMs）的最新进展已将其能力扩展至视频理解领域。然而，这些模型常受“幻觉”问题困扰，即生成与视频实际内容无关或无意义的信息。本研究提出了VideoHallucer，这是首个用于大型视频语言模型（LVLMs）幻觉检测的综合基准。VideoHallucer将幻觉分为两大主要类型：内在幻觉与外在幻觉，并进一步细分子类别以进行详细分析，包括物体关系、时序、语义细节、外在事实性及外在非事实性幻觉。我们采用对抗性二元视频问答方法进行全面评估，通过策略性地构建基础问题与幻觉问题对。通过在VideoHallucer上对十一个LVLM进行评估，我们发现：i) 当前大多数模型存在显著的幻觉问题；ii) 尽管扩大数据集和参数规模能提升模型检测基础视觉线索和反事实的能力，但对检测外在事实性幻觉的改善有限；iii) 现有模型更擅长识别事实而非检测幻觉。作为副产品，这些分析进一步指导了我们自研的PEP框架的开发，该框架在所有模型架构上平均实现了5.38%的抗幻觉性能提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日