Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.
翻译:视频质量评估(VQA)是一个具有广泛应用前景且富有挑战性的研究课题。传统的手工设计及基于判别学习的VQA模型主要关注像素级失真,缺乏上下文理解能力;而近年来多模态大语言模型(MLLMs)对小失真敏感度不足,且通常将质量评分与描述视为分离的任务。为克服这些局限,我们提出CP-LLM:一种上下文与像素感知的大语言模型。CP-LLM采用新颖的多模态大语言模型架构,配备双视觉编码器,可分别从高层(视频上下文)和低层(像素失真)粒度独立分析感知质量,并通过语言解码器后续推理这两个方面之间的相互作用。该设计使CP-LLM能同步生成稳健的质量分数与可解释的质量描述,同时增强对像素失真(如压缩伪影)的敏感度。实验结果表明,CP-LLM在VQA基准测试中实现了跨数据集的最优性能,并对像素失真展现出卓越的鲁棒性。