Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token--subject queries and context queries--gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.
翻译:情感识别旨在通过主体中心与上下文视觉线索推断图像中主体的情感状态。现有方法通常遵循两阶段流水线:首先利用现成检测器定位主体,随后通过主体与上下文特征的后融合进行情感分类。然而,这种复杂范式存在训练阶段割裂、细粒度主体-上下文元素交互受限等问题。为解决该挑战,我们提出一种单阶段情感识别方法,采用解耦主体-上下文Transformer(DSCT),同步实现主体定位与情感分类。不同于分割训练阶段的传统方式,我们联合利用边界框与情感信号作为监督信号,以丰富主体中心特征学习。此外,我们引入DSCT,以"先解耦后融合"的方式促进细粒度主体-上下文线索间的交互。解耦后的查询令牌——主体查询与上下文查询——在DSCT各层间逐步交织融合,在此过程中空间与语义关系被挖掘汇聚。我们在两个广泛使用的上下文感知情感识别数据集CAER-S和EMOTIC上评估单阶段框架。本方法以更少参数量超越两阶段替代方案,在CAER-S和EMOTIC数据集上分别实现3.39%的准确率提升与6.46%的平均精度增益。