Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.
翻译:视频语言模型(Video-LLMs)容易产生幻觉,当视觉证据较弱、模糊或存在偏差时,常生成看似合理但缺乏依据的内容。现有的解码方法,如对比解码(CD),依赖于随机扰动来构建对比数据以缓解幻觉模式。然而,这种方式难以控制引发幻觉的视觉线索,也难以与模型弱点良好对齐。我们提出基于模型感知反事实数据的对比解码(MACD),这是一种将模型引导的反事实构建与解码相结合的新型推理策略。该方法利用Video-LLM自身的反馈来识别对幻觉负主要责任的物体区域,在物体层面生成有针对性的反事实输入,而非任意修改帧或时序信息。这些模型感知的反事实数据随后被整合到CD中,以在解码过程中强化基于证据的标记选择。在EventHallusion、MVBench、Perception-test和Video-MME上的实验表明,MACD能持续减少幻觉,同时在包括Qwen和InternVL系列在内的多种Video-LLMs上保持或提升任务准确性。该方法在处理涉及小型、遮挡或共现物体的挑战性场景时尤为有效。我们的代码和数据将公开发布。