Speech emotion recognition is an important component of any human centered system. But speech characteristics produced and perceived by a person can be influenced by a multitude of reasons, both desirable such as emotion, and undesirable such as noise. To train robust emotion recognition models, we need a large, yet realistic data distribution, but emotion datasets are often small and hence are augmented with noise. Often noise augmentation makes one important assumption, that the prediction label should remain the same in presence or absence of noise, which is true for automatic speech recognition but not necessarily true for perception based tasks. In this paper we make three novel contributions. We validate through crowdsourcing that the presence of noise does change the annotation label and hence may alter the original ground truth label. We then show how disregarding this knowledge and assuming consistency in ground truth labels propagates to downstream evaluation of ML models, both for performance evaluation and robustness testing. We end the paper with a set of recommendations for noise augmentations in speech emotion recognition datasets.
翻译:语音情感识别是任何以人为本系统的重要组成部分。但个体产生和感知的语音特征可能受到多种因素的影响,包括情感等期望因素和噪声等非期望因素。为训练鲁棒的情感识别模型,我们需要大规模且符合真实分布的数据集,然而情感数据集通常规模较小,因此常通过添加噪声进行增强。噪声增强通常隐含一个重要假设:即预测标签在存在或不存在噪声时应保持不变。这一假设对自动语音识别成立,但对基于感知的任务未必适用。本文提出三项创新贡献:通过众包验证证实噪声的存在会改变标注标签,从而可能改变原始真实标注;进而证明忽略这一事实、假设真实标签一致性会如何影响机器学习模型的下游评估(包括性能评估与鲁棒性测试);最后提出一套针对语音情感识别数据集中噪声增强的改进建议。