Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.
翻译:视频质量显著影响视频分类效果。我们在对清晰视频中的轻度认知障碍进行分类时表现良好,但在处理模糊视频时性能下降,从而发现了这一问题。自此,我们意识到引入视频质量评估(VQA)可能提升视频分类性能。为实现这一目标,本文提出了结合无参考VQA的自监督学习视频视觉Transformer视频分类方法(SSL-V3)。SSL-V3利用组合式自监督学习机制将VQA融入视频分类,并解决视频数据集中普遍存在的VQA标签短缺问题——该问题导致无法提供准确的视频质量评分。简而言之,组合式自监督学习将视频质量评分作为直接调整视频分类特征图的因子。随后,该评分作为交叉连接点,将VQA与分类任务相关联,利用有监督的分类任务来优化VQA参数。SSL-V3在两个数据集上取得了稳健的实验结果:例如在I-CONECT(包含面部视频的医疗数据集)的部分访谈视频中达到了94.87%的准确率,验证了SSL-V3的有效性。