This paper highlights the critical importance of multi-channel speech enhancement (MCSE) for speech emotion recognition (ER) in cocktail party scenarios. A multi-channel speech dereverberation and separation front-end integrating DNN-WPE and mask-based MVDR is used to extract the target speaker's speech from the mixture speech, before being fed into the downstream ER back-end using HuBERT- and ViT-based speech and visual features. Experiments on mixture speech constructed using the IEMOCAP and MSP-FACE datasets suggest the MCSE output consistently outperforms domain fine-tuned single-channel speech representations produced by: a) Conformer-based metric GANs; and b) WavLM SSL features with optional SE-ER dual task fine-tuning. Statistically significant increases in weighted, unweighted accuracy and F1 measures by up to 9.5%, 8.5% and 9.1% absolute (17.1%, 14.7% and 16.0% relative) are obtained over the above single-channel baselines. The generalization of IEMOCAP trained MCSE front-ends are also shown when being zero-shot applied to out-of-domain MSP-FACE data.
翻译:本文强调了多通道语音增强在鸡尾酒会场景语音情感识别中的关键重要性。研究采用集成DNN-WPE与基于掩码的MVDR波束成形技术的多通道语音去混响与分离前端,从混合语音中提取目标说话人语音,随后输入至使用HuBERT与ViT基语音及视觉特征的下游情感识别后端。基于IEMOCAP与MSP-FACE数据集构建混合语音的实验表明,多通道语音增强输出始终优于以下域适应微调的单通道语音表征:a) 基于Conformer的度量生成对抗网络;b) 具备可选语音增强-情感识别双任务微调的WavLM自监督学习特征。相较于上述单通道基线方法,在加权准确率、未加权准确率及F1值上分别获得最高9.5%、8.5%与9.1%的绝对提升(相对提升分别为17.1%、14.7%与16.0%),且具有统计显著性。研究还验证了在IEMOCAP数据上训练的多通道语音增强前端,在零样本迁移至域外MSP-FACE数据时仍保持良好泛化能力。