Recognising emotions in context involves identifying the apparent emotions of an individual, taking into account contextual cues from the surrounding scene. Previous approaches to this task have involved the design of explicit scene-encoding architectures or the incorporation of external scene-related information, such as captions. However, these methods often utilise limited contextual information or rely on intricate training pipelines. In this work, we leverage the groundbreaking capabilities of Vision-and-Large-Language Models (VLLMs) to enhance in-context emotion classification without introducing complexity to the training process in a two-stage approach. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion relative to the visual context. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture that fuses text and visual features before the final classification task. Our experimental results show that the text and image features have complementary information, and our fused architecture significantly outperforms the individual modalities without any complex training methods. We evaluate our approach on three different datasets, namely, EMOTIC, CAER-S, and BoLD, and achieve state-of-the-art or comparable accuracy across all datasets and metrics compared to much more complex approaches. The code will be made publicly available on github: https://github.com/NickyFot/EmoCommonSense.git
翻译:理解上下文中的情绪涉及根据周围场景的语境线索识别个体的明显情绪。以往的研究方法包括设计显式场景编码架构或引入外部场景相关信息(如字幕)。然而,这些方法往往利用有限的上下文信息或依赖复杂的训练流程。本研究利用视觉与大型语言模型(VLLMs)的突破性能力,通过两阶段方法在不增加训练过程复杂性的前提下增强上下文情绪分类。第一阶段,我们通过提示VLLMs以自然语言生成关于受试者相对于视觉场景的明显情绪描述;第二阶段,将这些描述作为上下文信息,与图像输入共同训练一个基于Transformer的架构,在最终分类任务前融合文本与视觉特征。实验结果表明,文本与图像特征具有互补信息,且融合架构无需复杂训练方法即可显著优于单一模态。我们在EMOTIC、CAER-S和BoLD三个数据集上评估该方法,在所有数据集和评价指标上均达到与更复杂方法相当甚至更优的准确率。代码将公开发布于GitHub:https://github.com/NickyFot/EmoCommonSense.git