Video Quality Assessment (VQA), which aims to predict the perceptual quality of a video, has attracted raising attention with the rapid development of streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared with other sequence-based visual tasks (\textit{e.g.,} action recognition), VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. \textit{First}, it is not rare that several frames containing serious distortions (\textit{e.g.,}blocking, blurriness), can determine the perceptual quality of the whole video, while other sequence-based tasks require more frames of equal importance for representations. \textit{Second}, the perceptual quality of a video exhibits a multi-distortion distribution, due to the differences in the duration and probability of occurrence for various distortions. In order to solve the above challenges, we propose \textit{Visual Quality Transformer (VQT)} to extract quality-related sparse features more efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to sample keyframes by analyzing the temporal correlation between frames, which reduces the computational complexity from $O(T^2)$ to $O(T \log T)$. Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA modules with different degrees of sparsity in parallel, capturing co-existing distortions in a video. Experimentally, VQT demonstrates superior performance than many \textit{state-of-the-art} methods in three public no-reference VQA datasets. Furthermore, VQT shows better performance in four full-reference VQA datasets against widely-adopted industrial algorithms (\textit{i.e.,} VMAF and AVQT).
翻译:视频质量评估(VQA)旨在预测视频的感知质量,随着流媒体技术(如Facebook、TikTok、快手等)的快速发展,其关注度日益提升。与其他基于序列的视觉任务(例如动作识别)不同,VQA面临用户生成内容(UGC)视频中两个未被充分评估的挑战。首先,包含严重失真(例如块效应、模糊)的若干帧可能决定整个视频的感知质量,而这在序列任务中并不罕见;相比之下,其他序列任务需要更多重要性相当的帧来表征。其次,由于各种失真的持续时间和出现概率存在差异,视频的感知质量呈现多失真分布。为解决上述挑战,我们提出视觉质量Transformer(VQT)以更高效地提取与质量相关的稀疏特征。在方法上,提出稀疏时间注意力(STA)机制,通过分析帧间时间相关性对关键帧进行采样,将计算复杂度从$O(T^2)$降低至$O(T \log T)$。在结构上,多路径时间网络(MPTN)并行利用多个具有不同稀疏程度的STA模块,捕捉视频中并存的失真。实验结果表明,在三个公开的无参考VQA数据集中,VQT的性能优于许多最先进方法。此外,在四个全参考VQA数据集中,VQT相比广泛采用的工业算法(即VMAF和AVQT)展现了更优性能。