VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations

Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .

翻译：视频视觉语言模型（Video-VLM）中的幻觉现象仍然频繁出现且置信度高，而现有的不确定性度量方法往往无法与正确性对齐。本文提出VideoHEDGE，一种用于视频问答幻觉检测的模块化框架，它将基于熵的可靠性估计从图像领域扩展到具有时间结构的输入。给定视频-问题对，VideoHEDGE从原始视频片段以及经过光度与时空扰动的变体中，抽取一个基线答案和多个高温生成结果，随后使用基于自然语言推理（NLI）或基于嵌入的方法，将生成的文本输出聚类为语义假设。基于簇级别的概率质量，我们得到三个可靠性分数：语义熵（SE）、RadFlag和视觉增强语义熵（VASE）。我们在SoccerChat基准上评估VideoHEDGE，采用大语言模型作为评判者来获取二元幻觉标签。在三个7B参数的Video-VLM（Qwen2-VL、Qwen2.5-VL以及一个在SoccerChat上微调的模型）上，VASE始终获得最高的ROC-AUC，尤其在更大的失真预算下表现更优，而SE和RadFlag的性能常接近随机水平。我们进一步表明，基于嵌入的聚类在检测性能上与基于NLI的聚类相当，但计算成本显著降低；同时，领域微调虽然减少了幻觉频率，但对模型校准的改善有限。我们通过PyPI库hedge-bench支持可复现和可扩展的基准测试，完整代码与实验资源可在https://github.com/Simula/HEDGE#videohedge 获取。