Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations

Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.

翻译：人类在特定时刻所体验的往往并非单一基本情绪，而是多种情绪以不同显著度混合而成的复合状态。尽管此类混合情绪具有重要研究价值，现有基于视频的情绪识别方法大多仅能识别单一情绪。少数尝试识别混合情绪的方法通常无法评估混合情绪中各成分的相对显著度。这一局限主要源于缺乏包含大量标注相对显著度的混合情绪样本数据集。为弥补这一不足，我们提出了BLEMORE——一个包含情绪相对显著度信息的新型多模态（视频、音频）混合情绪识别数据集。BLEMORE包含来自58位演员的3000余个视频片段，涵盖6种基本情绪与10种特定混合情绪，其中每种混合情绪包含3种不同显著度配置（50/50、70/30及30/70）。基于该数据集，我们在两项混合情绪预测任务上对前沿视频分类方法进行了全面评估：（1）预测样本中存在的情绪类型；（2）预测混合情绪中各成分的相对显著度。实验结果表明：单模态分类器在验证集上取得最高29%的存在识别准确率与13%的显著度识别准确率；而多模态方法展现出明显优势，其中ImageBind + WavLM组合达到35%的存在识别准确率，HiCMAE达到18%的显著度识别准确率。在独立测试集上，最佳模型分别取得33%的存在识别准确率（VideoMAEv2 + HuBERT）与18%的显著度识别准确率（HiCMAE）。综上所述，BLEMORE数据集为推动情绪识别系统研究提供了重要资源，有助于深入理解混合情绪表达的复杂性与重要性。