Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.
翻译:混合情感识别具有挑战性,因为情感常以细微且重叠的多模态线索混合形式表达,而非单一主导信号。我们提出了一种排名感知的多编码器框架,该框架可从多样化的预提取视频与音频编码器中选择性地融合互补表示。我们的方法将异构编码器特征投影到共享隐空间,通过基于注意力的门控模块估计每个样本的编码器重要性,并仅融合排名前N个信息量最大的编码器。为更有效地建模混合情感,我们将预测解耦为存在性头和显著性头,并通过概率级融合对齐二者。此外,我们在无需伪标签的情况下引入特征级无监督领域自适应,以提升分布偏移下的鲁棒性。在BlEmoRE挑战赛上的实验表明,所提出的框架优于强大的单编码器及简单的多编码器融合基线。我们的最终系统在该竞赛中排名第二,验证了排名感知选择性融合在细粒度混合情感识别中的有效性。