Self-attention based Transformer has achieved great success in many computer vision tasks. However, its application to video quality assessment (VQA) has not been satisfactory so far. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a co-trained Space-Time Attention network for the VQA problem, termed StarVQA+. Specifically, we first build StarVQA+ by alternately concatenating the divided space-time attention. Then, to facilitate the training of StarVQA+, we design a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special token as the learnable variable of MOS, leading to better fitting of human's rating process. Finally, to solve the data hungry problem with Transformer, we propose to co-train the spatial and temporal attention weights using both images and videos. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-Qualcomm, LIVE-VQC, KoNViD-1k, YouTube-UGC, LSVQ, LSVQ-1080p, and DVL2021. Experimental results demonstrate the superiority of the proposed StarVQA+ over the state-of-the-art.
翻译:基于自注意力的Transformer已在众多计算机视觉任务中取得巨大成功,但其在视频质量评估(VQA)中的应用至今未达预期。由于缺乏原始参考信息和存在拍摄失真,对野外视频进行质量评估颇具挑战。本文提出一种面向VQA问题的联合训练时空注意力网络,命名为StarVQA+。具体而言,我们首先通过交替拼接分割的时空注意力构建StarVQA+网络。其次,为促进StarVQA+的训练,我们设计了一种向量化回归损失函数,通过将平均意见得分(MOS)编码为概率向量并嵌入特殊标记作为MOS的可学习变量,从而更精准地拟合人类评分过程。最后,为解决Transformer的数据匮乏问题,我们提出利用图像和视频联合训练空间与时间注意力权重。在LIVE-Qualcomm、LIVE-VQC、KoNViD-1k、YouTube-UGC、LSVQ、LSVQ-1080p及DVL2021等权威野外视频数据集上开展了大量实验。实验结果表明,所提出的StarVQA+相较于现有最优方法具有显著优越性。