The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.
翻译:在线视觉内容的爆炸式增长要求精确的机器评估器能够稳健地评估各类视觉内容的分数。尽管近年研究表明大型多模态模型(LMMs)在诸多相关领域展现出巨大潜力,本研究探索如何训练这些模型进行符合人类意见的视觉评分。鉴于人类评估者在主观研究中仅学习并判断离散文本定义的等级,我们提出模拟这一主观过程,使用文本定义的评分等级而非分数来训练LMMs。所提出的Q-Align在原始LMM架构下,于图像质量评估(IQA)、图像美学评估(IAA)及视频质量评估(VQA)任务中均达到最优性能。基于该训练方案,我们进一步将三项任务统一至单一模型,命名为OneAlign。实验证明,相较于基于直接分数的变体,基于离散等级的训练方案对LMMs更具优势。我们的代码及预训练权重已发布于https://github.com/Q-Future/Q-Align。