Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.
翻译:视觉大语言模型(VLLMs)普遍被认为容易产生幻觉。现有针对此问题的研究主要局限于图像输入,对基于视频的幻觉探索有限。此外,当前的评估方法未能捕捉生成回答中的细微错误,而这些错误常因视频丰富的时空动态性而加剧。为此,我们推出了VidHal,这是一个专门用于评估VLLMs中视频幻觉的基准。VidHal通过跨多种常见时序维度自举视频实例构建而成。我们基准的一个关键特征在于精心构建了代表与每个视频相关的不同幻觉程度的描述文本。为实现细粒度评估,我们提出了一种新颖的描述排序任务,要求VLLMs根据幻觉程度对描述进行排序。我们在VidHal上进行了大量实验,并对一系列广泛选用的模型进行了全面评估。我们的结果揭示了现有VLLMs在幻觉生成方面存在显著局限。通过本基准,我们旨在推动以下方向的进一步研究:1)对VLLM能力(尤其是幻觉相关能力)的整体理解,以及2)广泛开发先进的VLLM以缓解此问题。