Spatio-Temporal Fuzzy-oriented Multi-Modal Meta-Learning for Fine-grained Emotion Recognition

Fine-grained emotion recognition (FER) plays a vital role in various fields, such as disease diagnosis, personalized recommendations, and multimedia mining. However, existing FER methods face three key challenges in real-world applications: (i) they rely on large amounts of continuously annotated data to ensure accuracy since emotions are complex and ambiguous in reality, which is costly and time-consuming; (ii) they cannot capture the temporal heterogeneity caused by changing emotion patterns, because they usually assume that the temporal correlation within sampling periods is the same; (iii) they do not consider the spatial heterogeneity of different FER scenarios, that is, the distribution of emotion information in different data may have bias or interference. To address these challenges, we propose a Spatio-Temporal Fuzzy-oriented Multi-modal Meta-learning framework (ST-F2M). Specifically, ST-F2M first divides the multi-modal videos into multiple views, and each view corresponds to one modality of one emotion. Multiple randomly selected views for the same emotion form a meta-training task. Next, ST-F2M uses an integrated module with spatial and temporal convolutions to encode the data of each task, reflecting the spatial and temporal heterogeneity. Then it adds fuzzy semantic information to each task based on generalized fuzzy rules, which helps handle the complexity and ambiguity of emotions. Finally, ST-F2M learns emotion-related general meta-knowledge through meta-recurrent neural networks to achieve fast and robust fine-grained emotion recognition. Extensive experiments show that ST-F2M outperforms various state-of-the-art methods in terms of accuracy and model efficiency. In addition, we construct ablation studies and further analysis to explore why ST-F2M performs well.

翻译：细粒度情感识别在疾病诊断、个性化推荐和多媒体挖掘等多个领域发挥着至关重要的作用。然而，现有的细粒度情感识别方法在实际应用中面临三个关键挑战：(i) 由于现实中情感复杂且模糊，现有方法依赖大量连续标注数据来确保准确性，这过程成本高昂且耗时；(ii) 这些方法通常假设采样周期内的时间相关性是相同的，因此无法捕捉由情感模式变化引起的时间异质性；(iii) 它们未考虑不同细粒度情感识别场景的空间异质性，即不同数据中情感信息的分布可能存在偏差或干扰。为应对这些挑战，我们提出了一种面向时空模糊性的多模态元学习框架。具体而言，ST-F2M首先将多模态视频划分为多个视图，每个视图对应一种情感的单一模态。针对同一情感随机选取的多个视图构成一个元训练任务。接着，ST-F2M使用一个结合空间与时间卷积的集成模块对每个任务的数据进行编码，以反映时空异质性。随后，它基于广义模糊规则为每个任务添加模糊语义信息，这有助于处理情感的复杂性与模糊性。最后，ST-F2M通过元循环神经网络学习与情感相关的通用元知识，以实现快速且鲁棒的细粒度情感识别。大量实验表明，ST-F2M在准确性和模型效率方面均优于多种先进方法。此外，我们通过消融研究和进一步分析探讨了ST-F2M性能优异的原因。