Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.
翻译:图像美学评估是一项关键且复杂的任务,涉及分析和评估图像的美学价值,并识别其亮点与可改进之处。传统的图像美学评估方法通常专注于单一美学任务,且受限于标注数据集不足,从而阻碍了深入的美学理解。尽管已有研究尝试通过应用多模态大语言模型来克服这一挑战,但此类模型在图像美学评估任务上仍不成熟。为此,我们提出了一种能够进行细致美学洞察的综合美学多模态大语言模型。我们方法的核心是一种创新的多尺度文本引导自监督学习技术。该技术包含一个多尺度特征对齐模块,并以自监督方式充分利用大量未标注数据,从结构和功能上增强美学能力。实证结果表明,结合广泛的指令微调,我们的模型在多项任务上创造了新的最先进基准,包括美学评分、美学评论和个性化图像美学评估。值得注意的是,该模型在新兴的美学建议任务中也展现了零样本学习能力。此外,针对个性化图像美学评估,我们利用了上下文学习的潜力,并展示了其固有优势。