Although automatic emotion recognition (AER) has recently drawn significant research interest, most current AER studies use manually segmented utterances, which are usually unavailable for dialogue systems. This paper proposes integrating AER with automatic speech recognition (ASR) and speaker diarisation (SD) in a jointly-trained system. Distinct output layers are built for four sub-tasks including AER, ASR, voice activity detection and speaker classification based on a shared encoder. Taking the audio of a conversation as input, the integrated system finds all speech segments and transcribes the corresponding emotion classes, word sequences, and speaker identities. Two metrics are proposed to evaluate AER performance with automatic segmentation based on time-weighted emotion and speaker classification errors. Results on the IEMOCAP dataset show that the proposed system consistently outperforms two baselines with separately trained single-task systems on AER, ASR and SD.
翻译:尽管自动情感识别(AER)近期引起了广泛的研究兴趣,但当前多数AER研究依赖于手动分割的语句,而对话系统中通常无法获取此类分割。本文提出将AER与自动语音识别(ASR)及说话人分割(SD)集成到联合训练系统中。基于共享编码器,针对AER、ASR、语音活动检测和说话人分类四个子任务构建了独立输出层。该集成系统以对话音频为输入,可定位所有语音片段并转录对应的情感类别、词序列及说话人身份。本文提出基于时间加权的情感与说话人分类误差的两项指标,用于评估自动分割条件下的AER性能。在IEMOCAP数据集上的实验结果表明,相较于分别训练的AER、ASR和SD单任务系统组成的两个基线,本文提出的系统在各项任务上均表现更优。