VQA$^2$: Visual Question Answering for Video Quality Assessment

The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA2 series models. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA2series models achieve excellent performance in both tasks. Notably, our final model, the VQA2-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.

翻译：大型多模态模型（LMMs）的出现和普及为计算机视觉引入了新的范式，将多种任务统一到视觉问答框架中。视频质量评估（VQA）作为低级视觉感知领域的经典研究方向，最初主要关注定量的视频质量评分。然而，在LMMs进步的推动下，该领域正朝着更全面的视觉质量理解任务发展。近期在图像领域的研究表明，视觉问答（VQA）能够显著提升低级视觉质量评估的效果。然而，相关研究在视频领域尚未得到探索，存在巨大的改进空间。为填补这一空白，我们提出了VQA2指令数据集——首个专注于视频质量评估的视觉问答指令数据集。该数据集包含3个子集，涵盖多种视频类型，共计157,755条指令问答对。在此基础上，我们进一步提出了VQA2系列模型。VQA2系列模型通过交错视觉令牌与运动令牌，增强了对视频时空质量细节的感知能力。我们在视频质量评分与理解任务上进行了大量实验，结果表明VQA2系列模型在两项任务中均取得了优异性能。值得注意的是，我们的最终模型VQA2-Assistant在视觉质量理解任务上超越了知名的GPT-4o，同时在质量评分任务中保持了强大的竞争力。本研究为将低级视频质量评估与理解同LMMs相结合提供了基础与可行路径。