This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). The code is available at https://github.com/BetterZH/SEVLM-code.
翻译:本文开发了小型视觉语言模型以理解视觉艺术,其目标是在给定艺术作品的情况下,识别其情感类别并用自然语言解释该预测。虽然小型模型计算效率高,但其能力与大型模型相比存在显著局限。为突破这一权衡,本文通过情感建模与输入输出特征对齐构建了小型情感视觉语言模型(SEVLM)。一方面,基于心理学专家标注的效价-唤醒度-优势度(VAD)知识,我们引入并通过VAD词典与VAD头部融合情感特征,以对齐预测情感解释与真实标注的VAD向量。相较于仅使用传统文本嵌入,该方法使视觉语言模型能更好地理解和生成情感文本。另一方面,我们设计了对比头部以拉近图像、其情感类别及解释的嵌入表示,从而对齐模型输出与输入。在两个公开的情感解释数据集上,我们证明所提技术能持续提升基线SEVLMs的视觉艺术理解性能。重要的是,所提模型可在单张RTX 2080 Ti显卡上完成训练与评估,同时展现出极强的性能:它不仅优于当前最先进的小型模型,而且在微调后与LLaVA 7B及GPT4(V)相比也具备竞争力。代码发布于https://github.com/BetterZH/SEVLM-code。