Human beings have rich ways of emotional expressions, including facial action, voice, and natural languages. Due to the diversity and complexity of different individuals, the emotions expressed by various modalities may be semantically irrelevant. Directly fusing information from different modalities may inevitably make the model subject to the noise from semantically irrelevant modalities. To tackle this problem, we propose a multimodal relevance estimation network to capture the relevant semantics among modalities in multimodal emotions. Specifically, we take advantage of an attention mechanism to reflect the semantic relevance weights of each modality. Moreover, we propose a relevant semantic estimation loss to weakly supervise the semantics of each modality. Furthermore, we make use of contrastive learning to optimize the similarity of category-level modality-relevant semantics across different modalities in feature space, thereby bridging the semantic gap between heterogeneous modalities. In order to better reflect the emotional state in the real interactive scenarios and perform the semantic relevance analysis, we collect a single-label discrete multimodal emotion dataset named SDME, which enables researchers to conduct multimodal semantic relevance research with large category bias. Experiments on continuous and discrete emotion datasets show that our model can effectively capture the relevant semantics, especially for the large deviations in modal semantics. The code and SDME dataset will be publicly available.
翻译:人类具有丰富的情感表达方式,包括面部动作、语音和自然语言。由于不同个体间的多样性与复杂性,各类模态所表达的情感可能存在语义不相关的问题。直接融合不同模态的信息可能使模型受到来自语义不相关模态的噪声干扰。为解决这一问题,我们提出一种多模态相关性估计网络,用于捕捉多模态情感中模态间的相关语义。具体而言,我们利用注意力机制来反映各模态的语义相关性权重。此外,我们提出一种相关语义估计损失函数,对每个模态的语义进行弱监督学习。进一步地,我们采用对比学习在特征空间中优化不同模态间类别级模态相关语义的相似性,从而弥合异质模态间的语义鸿沟。为更真实地反映交互场景中的情感状态并进行语义相关性分析,我们构建了一个名为SDME的单标签离散多模态情感数据集,使研究者能够在大类别偏置条件下开展多模态语义相关性研究。在连续和离散情感数据集上的实验表明,我们的模型能够有效捕捉相关语义,尤其适用于模态语义存在较大偏差的情况。相关代码及SDME数据集将公开发布。