VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.
翻译:视觉语言模型(VLMs)在一系列计算机视觉任务中已展现出日益精熟的能力,例如视觉问答和物体检测。这包括在艺术领域中不断增强的分析艺术作品乃至生成艺术的能力。在计算机科学家与艺术史学家的跨学科合作中,我们旨在揭示VLMs预测艺术风格的内在机制,并评估其与艺术史学家推理艺术风格所用标准的契合程度。我们采用一种潜在空间分解方法来识别驱动艺术风格预测的概念,并进行了定量评估、因果分析以及艺术史学家的专业评估。我们的研究结果表明,73%的提取概念被艺术史学家判定为表现出连贯且具有语义意义的视觉特征,并且用于预测特定艺术品风格的90%的概念被判定为相关。在模型成功使用不相关概念预测风格的案例中,艺术史学家指出了其可能成功的原因;例如,模型可能以更形式化的术语(如明暗对比)来“理解”某个概念。