Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.
翻译:视频质量评估(VQA)正从单一的平均意见分数向更丰富、多方面的视频内容评估演进。本文提出了大规模多维度VQA数据集UltraVQA,该数据集涵盖多样化的用户生成内容(UGC),并在五个关键质量维度上进行标注:运动质量、运动幅度、美学质量、内容质量和清晰度质量。数据集中每个视频均由超过3名人工评分者对这些维度进行评分,包含细粒度的子属性标签,并附有基于集体人工判断由GPT生成的解释性依据。为更好地利用这些丰富标注并改进离散质量分数评估,我们提出了解析分数优化(ASO)——一种为多维度VQA推导的具有理论依据的训练后优化目标。通过将质量评估重新构建为正则化决策过程,我们获得了闭式解,该解自然捕捉了人类评分的序数特性,确保与人类排序偏好保持一致。在实验中,我们的方法优于包括闭源API和开源模型在内的大多数基线方法,同时降低了质量预测的平均绝对误差(MAE)。本研究强调了多维度、可解释的标注以及基于强化的对齐机制在推进视频质量评估领域发展中的重要性。