Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach

The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neutral choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at https://github.com/VQAssessment/MaxVQA.

翻译：野外视频的激增极大地扩展了视频质量评估（VQA）问题。与早期通常关注有限失真类型的定义不同，野外视频的VQA尤其具有挑战性，因为它可能受到包括各种失真和多样化内容在内的复杂因素影响。尽管主观研究已为这些视频收集了总体质量分数，但抽象质量分数与具体因素之间的关系仍不明确，阻碍了VQA方法进行更具体的质量评估（例如视频的清晰度）。为解决此问题，我们在13个质量相关因素维度上收集了超过两百万条关于4543个野外视频的意见，包括拍摄中的真实失真（如运动模糊、噪声、闪烁）、压缩和传输引入的错误，以及关于语义内容和美学问题（如构图、摄像机轨迹）的高层体验，以建立多维度的Maxwell数据库。具体而言，我们要求受试者针对每个维度在正面、负面和中性选项中进行标记。这些解释级别的意见使我们能够衡量特定质量因素与抽象主观质量评级之间的关系，并对每类VQA算法在每个维度上进行基准测试，从而更全面地分析其优缺点。此外，我们提出了MaxVQA，一种语言提示的VQA方法，它修改了视觉语言基础模型CLIP，以更好地捕捉我们分析中发现的关键质量问题。MaxVQA能够联合评估各种特定质量因素和最终质量分数，在所有维度上达到最先进的准确性，并在现有数据集上表现出卓越的泛化能力。代码和数据可在https://github.com/VQAssessment/MaxVQA获取。