Training A Small Emotional Vision Language Model for Visual Art Comprehension

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). The code is available at https://github.com/BetterZH/SEVLM-code.

翻译：本文开发了小型视觉语言模型以理解视觉艺术，其目标是在给定艺术作品的情况下，识别其情感类别并用自然语言解释该预测。虽然小型模型计算效率高，但其能力与大型模型相比存在显著局限。为突破这一权衡，本文通过情感建模与输入输出特征对齐构建了小型情感视觉语言模型（SEVLM）。一方面，基于心理学专家标注的效价-唤醒度-优势度（VAD）知识，我们引入并通过VAD词典与VAD头部融合情感特征，以对齐预测情感解释与真实标注的VAD向量。相较于仅使用传统文本嵌入，该方法使视觉语言模型能更好地理解和生成情感文本。另一方面，我们设计了对比头部以拉近图像、其情感类别及解释的嵌入表示，从而对齐模型输出与输入。在两个公开的情感解释数据集上，我们证明所提技术能持续提升基线SEVLMs的视觉艺术理解性能。重要的是，所提模型可在单张RTX 2080 Ti显卡上完成训练与评估，同时展现出极强的性能：它不仅优于当前最先进的小型模型，而且在微调后与LLaVA 7B及GPT4(V)相比也具备竞争力。代码发布于https://github.com/BetterZH/SEVLM-code。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/