VQA$^2$:Visual Question Answering for Video Quality Assessment

The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.

翻译：大型多模态模型（LMMs）的出现和普及为视频相关的计算机视觉领域引入了新的范式，包括基于视觉问答（VQA）的训练与推理方法。这些方法使模型能够稳健地处理多个下游任务。视频质量评估（VQA）作为低层次视觉质量评估的经典领域，最初专注于定量的视频质量评分。然而，在LMMs进步的推动下，该领域正朝着更全面的视觉质量理解任务演进。近期，视觉问答在图像域内显著提升了低层次视觉评估的水平。然而，在视频域内相关研究几乎空白，存在巨大的改进空间。为填补这一空白，我们引入了VQA2指令数据集——首个完全专注于视频质量评估的视觉问答指令数据集，并基于此提出了VQA2系列模型。VQA2指令数据集包含三个阶段，涵盖多种视频类型，共包含157,735个指令问答对，其中既有人工标注数据，也有合成数据。我们在视频质量评分和视频质量理解任务上进行了广泛的实验。结果表明，VQA2系列模型在质量评分任务中达到了最先进的（SOTA）性能，其在视觉质量问答方面的表现超越了知名的GPT-4o。此外，我们的最终模型VQA2-Assistant在评分和问答任务上均表现优异，验证了其多功能性。