Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.
翻译:近期多模态大语言模型(MLLMs)在视频质量评估(VQA)任务中展现出良好性能。然而,由于需要大规模重新训练以及昂贵的平均意见得分(MOS)标注,将其适配到新场景仍成本高昂。本文指出,预训练的MLLM已为VQA提供了有效的感知先验,主要挑战在于如何将该先验高效校准至目标MOS空间。基于这一洞察,我们提出DPC-VQA——一种用于视频质量评估的感知与校准解耦框架。具体而言,DPC-VQA使用冻结的MLLM提供基础质量估计与感知先验,并采用轻量级校准分支预测残差校正以实现目标场景适配。该设计避免了昂贵的端到端重训练,同时以更低的训练与数据成本保持了可靠性能。在用户生成内容(UGC)和AI生成内容(AIGC)基准上的大量实验表明,DPC-VQA相较于代表性基线方法取得了具有竞争力的性能,同时仅使用传统基于MLLM的VQA方法不足2%的可训练参数,且仅需20%的MOS标签即可保持有效性。代码将在发表后公开。