Decades of research indicate that emotion recognition is more effective when drawing information from multiple modalities. But what if some modalities are sometimes missing? To address this problem, we propose a novel Transformer-based architecture for recognizing valence and arousal in a time-continuous manner even with missing input modalities. We use a coupling of cross-attention and self-attention mechanisms to emphasize relationships between modalities during time and enhance the learning process on weak salient inputs. Experimental results on the Ulm-TSST dataset show that our model exhibits an improvement of the concordance correlation coefficient evaluation of 37% when predicting arousal values and 30% when predicting valence values, compared to a late-fusion baseline approach.
翻译:数十年的研究表明,通过多模态信息进行情感识别更为有效。然而,当某些模态偶尔缺失时,应如何应对?为解决这一问题,我们提出了一种基于Transformer的新型架构,用于在时间连续的情境下识别效价与唤醒度,即使输入模态存在缺失。该方法通过交叉注意力与自注意力机制的耦合,强化了模态间在时间维度上的关联性,并提升了弱显著性输入下的学习效率。在Ulm-TSST数据集上的实验结果表明,相较于晚期融合基线方法,我们的模型在唤醒度与效价预测上的一致性相关系数分别提升了37%和30%。