Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment

The proliferation of videos collected during in-the-wild natural settings has pushed the development of effective Video Quality Assessment (VQA) methodologies. Contemporary supervised opinion-driven VQA strategies predominantly hinge on training from expensive human annotations for quality scores, which limited the scale and distribution of VQA datasets and consequently led to unsatisfactory generalization capacity of methods driven by these data. On the other hand, although several handcrafted zero-shot quality indices do not require training from human opinions, they are unable to account for the semantics of videos, rendering them ineffective in comprehending complex authentic distortions (e.g., white balance, exposure) and assessing the quality of semantic content within videos. To address these challenges, we introduce the text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP) to ascertain the affinity between textual prompts and visual features, facilitating a comprehensive examination of semantic quality concerns without the reliance on human quality annotations. By amalgamating SAQI with existing low-level metrics, we propose the unified Blind Video Quality Index (BVQI) and its improved version, BVQI-Local, which demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24\% on all datasets. Moreover, we devise an efficient fine-tuning scheme for BVQI-Local that jointly optimizes text prompts and final fusion weights, resulting in state-of-the-art performance and superior generalization ability in comparison to prevalent opinion-driven VQA methods. We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.

翻译：随着野外自然场景中采集的视频数量激增，推动了有效视频质量评估（VQA）方法的发展。当代监督式意见驱动的VQA策略主要依赖于昂贵的人类标注质量分数进行训练，这限制了VQA数据集的规模与分布，进而导致基于这些数据的方法泛化能力不足。另一方面，尽管若干人工设计的零样本质量指标无需依赖人类意见训练，但它们无法考虑视频语义信息，难以理解复杂的真实失真（如白平衡、曝光）并评估视频中语义内容的质量。为解决这些挑战，我们引入基于文本提示的语义亲和力质量指数（SAQI）及其局部化版本（SAQI-Local），利用对比语言-图像预训练（CLIP）确定文本提示与视觉特征之间的亲和度，在无需人类质量标注的前提下实现语义质量问题的全面分析。通过将SAQI与现有低层指标融合，我们提出统一的无参考视频质量指数（BVQI）及其改进版BVQI-Local，该模型在所有数据集上均展现出前所未有的性能，超越现有零样本指标至少24%。此外，我们设计了针对BVQI-Local的高效微调方案，联合优化文本提示与最终融合权重，较主流意见驱动型VQA方法实现了更优的泛化能力与最先进的性能。通过全面分析不同指标对质量问题的关注差异，我们验证了所提设计的有效性与合理性。