OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\% and Qwen3-Omni-Instruct reaches 41.55\%, versus 76.54\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \method lifts Qwen2.5-Omni-7B to 36.22\% and Qwen3 to 51.09\% on \bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.

翻译：长视频全模态助手出错时通常并非源于虚构内容，而是由于错误绑定真实证据：它们能正确识别语音内容和视觉事件，却将两者关联到错误的说话人、时间节点或模态。这类"近乎正确"的错误因局部证据仍然有效而能规避标准视频问答中的检测——条目级评分既可能奖励有依据的陈述，也可能奖励其近似反事实的变体。我们提出一种反事实事件绑定协议，通过从同一视听事件证据中构建配对的"有依据/反事实"陈述对，并采用严格配对准确率进行评估。据此我们构建了针对长视频全模态幻觉的基准测试集\bench，包含来自638个长视频的3,600个单陈述问答条目（平均时长24.16分钟，总时长256.87小时）。在此协议下，开源全模态模型在配对级绑定上表现薄弱：Qwen2.5-Omni-7B达32.06%，Qwen3-Omni-Instruct达41.55%，而闭源参考模型达76.54%。为在不更新骨干网络的情况下缩小差距，我们提出模态扰动可靠性校准方法\method——一种冻结骨干框架，通过在视频级折内选取音频负样本，将模态响应偏移与原生视听置信度融合为每个陈述的支持度估计。\method在\bench上将Qwen2.5-Omni-7B提升至36.22%，Qwen3提升至51.09%，并在OmniVideoBench（+2.20）和WorldSense（+1.51）上使用Qwen3改善了目标适应的多选题准确率。