With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.
翻译:随着视频生成技术的快速发展,对生成视频进行评估与审核变得日益重要。现有方法通常仅提供粗略的视频质量评分,缺乏对具体伪影的细粒度定位与分类。本研究提出了一套综合评估方案,重点关注影响人类感知的三个关键维度:外观、运动与摄像机视角。我们通过一个包含10类常见伪影的分类体系来定义这些维度,这些伪影类别反映了视频生成中普遍存在的生成缺陷。为支持鲁棒的伪影检测与分类,我们构建了GenVID大规模数据集,包含由多种前沿视频生成模型生成的8万个视频,每个视频均针对定义的伪影类别进行了精细标注。基于GenVID数据集,我们开发了DVAR(密集视频伪影识别)框架,用于对生成伪影进行细粒度识别与分类。大量实验表明,该方法显著提升了伪影检测准确率,并能有效过滤低质量内容。