Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment

Video Quality Assessment (VQA), which aims to predict the perceptual quality of a video, has attracted raising attention with the rapid development of streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared with other sequence-based visual tasks (\textit{e.g.,} action recognition), VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. \textit{First}, it is not rare that several frames containing serious distortions (\textit{e.g.,}blocking, blurriness), can determine the perceptual quality of the whole video, while other sequence-based tasks require more frames of equal importance for representations. \textit{Second}, the perceptual quality of a video exhibits a multi-distortion distribution, due to the differences in the duration and probability of occurrence for various distortions. In order to solve the above challenges, we propose \textit{Visual Quality Transformer (VQT)} to extract quality-related sparse features more efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to sample keyframes by analyzing the temporal correlation between frames, which reduces the computational complexity from $O(T^2)$ to $O(T \log T)$. Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA modules with different degrees of sparsity in parallel, capturing co-existing distortions in a video. Experimentally, VQT demonstrates superior performance than many \textit{state-of-the-art} methods in three public no-reference VQA datasets. Furthermore, VQT shows better performance in four full-reference VQA datasets against widely-adopted industrial algorithms (\textit{i.e.,} VMAF and AVQT).

翻译：视频质量评估（VQA）旨在预测视频的感知质量，随着流媒体技术（如Facebook、TikTok、快手等）的快速发展，其关注度日益提升。与其他基于序列的视觉任务（例如动作识别）不同，VQA面临用户生成内容（UGC）视频中两个未被充分评估的挑战。首先，包含严重失真（例如块效应、模糊）的若干帧可能决定整个视频的感知质量，而这在序列任务中并不罕见；相比之下，其他序列任务需要更多重要性相当的帧来表征。其次，由于各种失真的持续时间和出现概率存在差异，视频的感知质量呈现多失真分布。为解决上述挑战，我们提出视觉质量Transformer（VQT）以更高效地提取与质量相关的稀疏特征。在方法上，提出稀疏时间注意力（STA）机制，通过分析帧间时间相关性对关键帧进行采样，将计算复杂度从$O(T^2)$降低至$O(T \log T)$。在结构上，多路径时间网络（MPTN）并行利用多个具有不同稀疏程度的STA模块，捕捉视频中并存的失真。实验结果表明，在三个公开的无参考VQA数据集中，VQT的性能优于许多最先进方法。此外，在四个全参考VQA数据集中，VQT相比广泛采用的工业算法（即VMAF和AVQT）展现了更优性能。