Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations

Emotion recognition in conversations is essential for ensuring advanced human-machine interactions. However, creating robust and accurate emotion recognition systems in real life is challenging, mainly due to the scarcity of emotion datasets collected in the wild and the inability to take into account the dialogue context. The CEMO dataset, composed of conversations between agents and patients during emergency calls to a French call center, fills this gap. The nature of these interactions highlights the role of the emotional flow of the conversation in predicting patient emotions, as context can often make a difference in understanding actual feelings. This paper presents a multi-scale conversational context learning approach for speech emotion recognition, which takes advantage of this hypothesis. We investigated this approach on both speech transcriptions and acoustic segments. Experimentally, our method uses the previous or next information of the targeted segment. In the text domain, we tested the context window using a wide range of tokens (from 10 to 100) and at the speech turns level, considering inputs from both the same and opposing speakers. According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens. Furthermore, taking the last speech turn of the same speaker in the conversation seems useful. In the acoustic domain, we conducted an in-depth analysis of the impact of the surrounding emotions on the prediction. While multi-scale conversational context learning using Transformers can enhance performance in the textual modality for emergency call recordings, incorporating acoustic context is more challenging.

翻译：对话中的情感识别对于实现先进的人机交互至关重要。然而，在现实场景中构建鲁棒且准确的情感识别系统面临挑战，主要源于难以获取野外采集的情感数据集，以及无法充分考虑对话上下文。CEMO数据集由法国呼叫中心紧急通话中接线员与患者之间的对话组成，填补了这一空白。此类交互的特性凸显了对话情感流在预测患者情绪中的作用，因为上下文往往有助于理解真实感受。本文提出一种多尺度对话上下文学习方法用于语音情感识别，该方法充分利用了这一假设。我们分别在语音转录文本和声学片段上对该方法进行了研究。实验过程中，我们的方法利用了目标片段的前后信息。在文本领域，我们使用大范围的标记（从10到100个）在语境窗口上进行了测试，同时也在话轮层面考虑了同一说话者和相反说话者的输入。根据测试结果，来自前面标记的上下文对准确预测的影响比后续标记更为显著。此外，取对话中同一说话者的最后一个话轮似乎也有帮助。在声学领域，我们深入分析了周围情感对预测的影响。虽然使用Transformer的多尺度对话上下文学习能够提升紧急通话录音中文本模态的性能，但融入声学上下文更具挑战性。