Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token--subject queries and context queries--gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.
翻译:情绪识别旨在通过主体相关和上下文视觉线索辨别图像中主体的情绪状态。现有方法通常采用两阶段流程:首先通过现成检测器定位主体,随后通过主体与上下文特征的后融合执行情绪分类。然而,这种复杂范式存在训练阶段割裂、细粒度主体-上下文元素交互受限的问题。为解决这一挑战,我们提出了一种单阶段情绪识别方法,通过解耦主体-上下文Transformer(DSCT)同时实现主体定位与情绪分类。不同于将训练阶段分离的做法,我们联合利用框标注和情绪信号作为监督信号,以增强主体中心特征学习。此外,我们引入DSCT以"先解耦后融合"的方式促进细粒度主体-上下文线索间的交互。解耦后的查询令牌——主体查询与上下文查询——在DSCT各层中逐步交织,在此过程中空间与语义关系被挖掘并聚合。我们在两个广泛使用的上下文感知情绪识别数据集CAER-S和EMOTIC上评估了该单阶段框架。与两阶段方法相比,本方法在参数更少的情况下实现了性能提升:在CAER-S数据集上准确率提升3.39%,在EMOTIC数据集上平均精度提升6.46%。