Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .
翻译:视频视觉语言模型(Video-VLM)中的幻觉现象仍然频繁出现且置信度高,而现有的不确定性度量方法往往无法与正确性对齐。本文提出VideoHEDGE,一种用于视频问答幻觉检测的模块化框架,它将基于熵的可靠性估计从图像领域扩展到具有时间结构的输入。给定视频-问题对,VideoHEDGE从原始视频片段以及经过光度与时空扰动的变体中,抽取一个基线答案和多个高温生成结果,随后使用基于自然语言推理(NLI)或基于嵌入的方法,将生成的文本输出聚类为语义假设。基于簇级别的概率质量,我们得到三个可靠性分数:语义熵(SE)、RadFlag和视觉增强语义熵(VASE)。我们在SoccerChat基准上评估VideoHEDGE,采用大语言模型作为评判者来获取二元幻觉标签。在三个7B参数的Video-VLM(Qwen2-VL、Qwen2.5-VL以及一个在SoccerChat上微调的模型)上,VASE始终获得最高的ROC-AUC,尤其在更大的失真预算下表现更优,而SE和RadFlag的性能常接近随机水平。我们进一步表明,基于嵌入的聚类在检测性能上与基于NLI的聚类相当,但计算成本显著降低;同时,领域微调虽然减少了幻觉频率,但对模型校准的改善有限。我们通过PyPI库hedge-bench支持可复现和可扩展的基准测试,完整代码与实验资源可在https://github.com/Simula/HEDGE#videohedge 获取。