With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind groping may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at https://github.com/yipoh/AesBench.
翻译:在集体努力下,多模态大语言模型正处于蓬勃发展期。然而,它们在图像美学感知方面的表现仍不明确,而这一能力在实际应用中备受期待。一个明显的障碍在于缺乏专门的基准来评估多模态大语言模型在美学感知上的有效性。这种盲目摸索可能阻碍具备美学感知能力的更先进多模态大语言模型的进一步发展。为解决这一困境,我们提出了AesBench,这是一个旨在通过双方面精心设计全面评估多模态大语言模型美学感知能力的专家基准:(1)我们构建了专家标注的美学感知数据库(Expert-labeled Aesthetics Perception Database,简称EAPD),该数据库具有多样化的图像内容和由专业美学专家提供的高质量标注;(2)我们提出了一套综合性评价标准,从四个维度衡量多模态大语言模型的美学感知能力,包括感知(AesP)、共情(AesE)、评估(AesA)和阐释(AesI)。大量实验结果表明,当前多模态大语言模型仅具备初级的美学感知能力,与人类之间仍存在显著差距。我们希望这项工作能启发学界对多模态大语言模型美学潜力的更深入探索。原始数据将在https://github.com/yipoh/AesBench 提供。