In this paper, we propose XGC-AVis, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC-AVQuiz, the first benchmark aimed at comprehensively assessing MLLMs' understanding capabilities in both real-world and AI-generated scenarios. XGC-AVQuiz consists of 2,685 question-answer pairs across 20 tasks, with two key innovations: 1) AIGC Scenario Expansion: The benchmark includes 2,232 videos, comprising 1,102 professionally generated content (PGC), 753 user-generated content (UGC), and 377 AI-generated content (AIGC). These videos cover 10 major domains and 53 fine-grained categories. 2) Quality Perception Dimension: Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low-level sensory capabilities with high-level semantic understanding to assess audio-visual quality, synchronization, and coherence. Experimental results on XGC-AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC-AVis improves these capabilities without requiring additional training, as validated on two benchmarks.
翻译:本文提出XGC-AVis,一种多智能体框架,通过感知、规划、执行与反思四个阶段,增强多模态大模型(MLLMs)的音视频时序对齐能力,并提升关键视频片段检索效率。我们进一步推出XGC-AVQuiz,这是首个旨在全面评估MLLMs在真实场景与AI生成场景中理解能力的基准测试。XGC-AVQuiz包含20项任务共2,685组问答对,具备两项关键创新:1)AIGC场景扩展:该基准涵盖2,232个视频,包括1,102个专业生成内容(PGC)、753个用户生成内容(UGC)及377个AI生成内容(AIGC),覆盖10个主要领域与53个细分类别。2)质量感知维度:除识别、定位、推理等传统任务外,我们引入新颖的质量感知维度,要求MLLMs将低层感知能力与高层语义理解相结合,以评估音视频质量、同步性与连贯性。在XGC-AVQuiz上的实验结果表明,当前MLLMs在质量感知与时序对齐任务上表现欠佳。XGC-AVis无需额外训练即可提升这些能力,并在两个基准测试中得到验证。