Towards Explainable In-the-Wild Video Quality Assessment: a Database and a Language-Prompted Approach

The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neural choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at \url{https://github.com/VQAssessment/MaxVQA}.

翻译：野外视频的激增极大地扩展了视频质量评估（VQA）问题。与早期通常关注有限失真类型的定义不同，野外视频的VQA尤为具有挑战性，因为它可能受到复杂因素的影响，包括各种失真和多样化的内容。尽管主观研究已收集了这些视频的整体质量分数，但抽象质量分数与具体因素之间的关系仍不明确，阻碍了VQA方法进行更具体的质量评估（例如视频的清晰度）。为解决这一问题，我们在包含4,543个野外视频的13个质量相关因素维度上收集了超过两百万个意见，包括采集过程中的真实失真（如运动模糊、噪声、闪烁）、压缩和传输引入的误差，以及语义内容和美学问题（如构图、相机轨迹）方面的高层体验，从而构建了多维Maxwell数据库。具体而言，我们要求被试在每个维度上从正面、负面和中性选项中标注。这些解释层面的意见使我们能够衡量特定质量因素与抽象主观质量评分之间的关系，并在每个维度上对不同类型的VQA算法进行基准测试，从而更全面地分析其优缺点。此外，我们提出了MaxVQA，这是一种基于语言提示的VQA方法，它修改了视觉-语言基础模型CLIP，以更好地捕捉我们在分析中观察到的重要质量问题。MaxVQA能够联合评估各种具体质量因素和最终质量分数，在所有维度上达到最先进的准确性，并在现有数据集上展现出卓越的泛化能力。代码和数据可在https://github.com/VQAssessment/MaxVQA获取。