Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at \url{https://github.com/aimmemotion/EmoVIT}.
翻译:视觉指令调优是一种新颖的学习范式,涉及使用任务特定指令对预训练语言模型进行微调。该范式在多种自然语言处理任务中展现出优异的零样本性能,但在视觉情感理解领域尚未被探索。本工作聚焦于提升模型理解与遵循情感上下文相关指令的能力。首先,我们识别了视觉情感识别中关键视觉线索。其次,引入一种新颖的GPT辅助管道生成情感视觉指令数据,有效解决了该领域标注指令数据稀缺的问题。基于InstructBLIP的框架基础,我们提出的EmoVIT架构融合了情感特定指令数据,借助大型语言模型的强大能力以提升性能。通过大量实验,我们的模型在情感分类、情感推理及幽默理解方面展现出卓越能力。比较分析为LLM时代的情感视觉指令调优建立了稳健基准,为该领域未来探索提供了宝贵见解与研究路径。我们的代码已开源至\url{https://github.com/aimmemotion/EmoVIT}。