Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by "hallucinations", where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models exhibit significant issues with hallucinations; ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.
翻译:多模态大语言模型(MLLMs)的最新进展已将其能力扩展至视频理解领域。然而,这些模型常受“幻觉”问题困扰,即生成与视频实际内容无关或无意义的信息。本研究提出了VideoHallucer,这是首个用于大型视频语言模型(LVLMs)幻觉检测的综合基准。VideoHallucer将幻觉分为两大主要类型:内在幻觉与外在幻觉,并进一步细分子类别以进行详细分析,包括物体关系、时序、语义细节、外在事实性及外在非事实性幻觉。我们采用对抗性二元视频问答方法进行全面评估,通过策略性地构建基础问题与幻觉问题对。通过在VideoHallucer上对十一个LVLM进行评估,我们发现:i) 当前大多数模型存在显著的幻觉问题;ii) 尽管扩大数据集和参数规模能提升模型检测基础视觉线索和反事实的能力,但对检测外在事实性幻觉的改善有限;iii) 现有模型更擅长识别事实而非检测幻觉。作为副产品,这些分析进一步指导了我们自研的PEP框架的开发,该框架在所有模型架构上平均实现了5.38%的抗幻觉性能提升。