Whose Emotion Matters? Speaking Activity Localisation without Prior Knowledge

The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection and automatic speech recognition models, we are able to realign the videos of MELD and capture the facial expressions from speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD-FAIR videos more closely match the transcribed utterances given in the MELD dataset. Finally, we devise a model for emotion recognition in conversations trained on the realigned MELD-FAIR videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that localising the source of speaking activities is indeed effective for extracting facial expressions from the uttering speakers and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far. The MELD-FAIR realignment data, and the code of the realignment procedure and of the emotional recognition, are available at https://github.com/knowledgetechnologyuhh/MELD-FAIR.

翻译：多模态情感识别任务得益于多种模态数据的可用性，例如基于视频的多模态情感线数据集（MELD）提供的音视频信息。然而，仅有少数研究方法同时使用MELD视频中的听觉和视觉信息。其原因有二：首先，MELD中标签与视频的对齐存在噪声，导致这些视频作为情感语音数据源不可靠；其次，对话可能涉及同一场景中的多人，这需要定位话语来源。本文通过利用最新的活跃说话人检测和自动语音识别模型，提出基于重新对齐的固定视听信息的MELD（MELD-FAIR），能够重新对齐MELD视频，并从MELD提供的96.92%的话语中捕捉说话者的面部表情。基于自监督语音识别模型的实验表明，重新对齐后的MELD-FAIR视频与MELD数据集中转述的话语匹配度更高。最后，我们设计了一个基于重新对齐的MELD-FAIR视频训练的对话情感识别模型，其在仅依赖视觉信息的ERC任务中优于现有最优模型。这表明，定位说话活动的源头确实能有效提取说话者的面部表情，且面部特征相比当前最优模型使用的视觉特征能提供更具信息量的视觉线索。MELD-FAIR重新对齐数据、对齐流程及情感识别代码均可在https://github.com/knowledgetechnologyuhh/MELD-FAIR获取。