AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1\% vs 26.2\%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.

翻译：我们提出AVID，首个面向视频中音视频不一致性理解的大规模基准。尽管全模态大语言模型在时间对齐任务（如字幕生成与问答）中表现优异，但其难以感知跨模态冲突——这一人类基本能力对可信人工智能至关重要。现有基准主要聚焦于对齐事件或深度伪造检测，在评估长视频语境中的不一致性感知方面存在显著空白。AVID通过以下设计解决该问题：(1) 可扩展的构建流水线，包含将视频内容分类为活动说话人、画外音和场景类别的时序分割模块；选择语义恰当不一致性类别的智能体驱动策略规划器；以及五种专用注入器用于生成多样化音视频冲突；(2) 11.2K个长视频（平均235.5秒），包含39.4K个带标注的不一致性事件和78.7K个片段剪辑，支持8个细粒度不一致性类别的检测、时间定位、分类和推理评估。对现有最优全模态模型的综合评估揭示了其在时间定位和推理方面的显著局限性。我们微调的基线模型AVID-Qwen相较基础模型取得显著提升（片段推理中BLEU-4提升2.8倍），并在时间定位（mIoU：36.1% vs 26.2%）和整体理解（SODA-m：7.47 vs 6.15）上超越所有对比模型，验证了AVID作为推动可信全模态AI系统发展的有效试验平台。