Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of MOS labels. The code will be released upon publication.
翻译:近年来,多模态大语言模型(MLLMs)在视频质量评价任务中展现出显著性能。然而,由于大规模重新训练和高昂的平均意见得分标注成本,将其迁移至新场景仍面临高昂代价。本文提出,预训练的MLLM已为视频质量评价提供了有效的感知先验,核心挑战在于如何高效地将该先验校准至目标MOS空间。基于这一认识,我们提出DPC-VQA——一种解耦感知与校准的视频质量评价框架。具体而言,DPC-VQA使用冻结的MLLM提供基础质量估计和感知先验,并采用轻量级校准分支预测残差校正量以实现目标场景适配。该设计避免了代价高昂的端到端重新训练,同时以更低的训练和数据成本保持了可靠性能。在用户生成内容(UGC)和人工智能生成内容(AIGC)基准上的大量实验表明,DPC-VQA在可训练参数仅为传统基于MLLM的视频质量评价方法的2%以下、仅需20%的MOS标签即可保持有效性条件下,仍能取得与代表性基线方法相竞争的性能。代码将于论文发表后公开。