VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.

翻译：视频质量评估旨在客观量化与人类视觉感知相符的感知质量退化。尽管近期取得进展，现有VQA模型仍存在两个关键局限：\textit{对分布外视频的泛化能力不足}与\textit{可解释性有限}，这限制了其在真实场景中的应用。为应对这些挑战，我们提出\textbf{VQAThinker}——一个基于推理的VQA框架，其利用大语言多模态模型结合强化学习，协同建模视频质量理解与评分过程，模拟人类感知决策机制。具体而言，我们采用组相对策略优化——一种规则引导的强化学习算法，使其能在分数级监督下进行视频质量推理，并引入三项VQA专用奖励函数：(1) \textbf{钟形回归奖励}，其随预测误差减小而快速增长，并在接近真实值时敏感性逐渐降低；(2) \textbf{成对排序奖励}，引导模型正确判断视频对之间的相对质量差异；(3) \textbf{时序一致性奖励}，激励模型更倾向于选择时序连贯的视频而非其扰动版本。大量实验表明，VQAThinker在域内与分布外VQA基准测试中均达到最先进性能，展现出强大的视频质量评分泛化能力。此外，在视频质量理解任务上的评估验证了其在失真归因与质量描述方面，相较于现有可解释VQA模型及大语言多模态模型具有显著优势。这些发现证明，强化学习为仅通过分数级监督构建可泛化且可解释的VQA模型提供了有效路径。