Image aesthetics assessment (IAA) is attracting wide interest with the prevalence of social media. The problem is challenging due to its subjective and ambiguous nature. Instead of directly extracting aesthetic features solely from the image, user comments associated with an image could potentially provide complementary knowledge that is useful for IAA. With existing large-scale pre-trained models demonstrating strong capabilities in extracting high-quality transferable visual and textual features, learnable queries are shown to be effective in extracting useful features from the pre-trained visual features. Therefore, in this paper, we propose MMLQ, which utilizes multi-modal learnable queries to extract aesthetics-related features from multi-modal pre-trained features. Extensive experimental results demonstrate that MMLQ achieves new state-of-the-art performance on multi-modal IAA, beating previous methods by 7.7% and 8.3% in terms of SRCC and PLCC, respectively.
翻译:图像美学评估(IAA)随着社交媒体的普及而受到广泛关注。由于美学评估具有主观性和模糊性,该问题具有挑战性。与直接仅从图像中提取美学特征不同,与图像关联的用户评论可能为IAA提供互补性知识。借助现有大规模预训练模型在提取高质量可迁移视觉和文本特征方面的强大能力,可学习查询被证明能够有效从预训练视觉特征中提取有用信息。因此,本文提出MMLQ,利用多模态可学习查询从多模态预训练特征中提取美学相关特征。大量实验结果表明,MMLQ在多模态IAA任务上取得了新的最优性能,在SRCC和PLCC指标上分别比现有方法提升7.7%和8.3%。