In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.
翻译:本文研究利用声学和文本特征进行语音情感分类的多种方法。我们提出使用BERT获取上下文相关的词嵌入来表示语音转录中包含的信息,并证明这种方法比使用Glove嵌入能获得更好的性能。我们还提出并比较了不同策略以融合音频和文本模态,并在IEMOCAP和MSP-PODCAST数据集上对其进行评估。我们发现,在两个数据集上融合声学和文本系统均具有优势,但不同融合方法之间仅观察到细微差异。最后,针对IEMOCAP数据集,我们展示了用于定义交叉验证折的标准对结果的巨大影响。具体而言,为该数据集创建折的标准方法会导致基于文本系统的性能估计过于乐观,这表明先前的一些工作可能高估了整合转录信息的优势。