Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.
翻译:摘要:语音情感识别模型通常依赖昂贵的人工标注数据进行训练,这使得将其扩展到大规模语音数据集和细粒度情感分类体系变得困难。我们提出LanSER方法,通过弱监督学习,利用预训练大语言模型推断弱情感标签,从而实现对未标注数据的利用。为推断受限于分类体系的弱标签,我们采用文本蕴含方法,从自动语音识别获取的语音转写文本中,选择蕴含分数最高的情感标签。实验结果表明,采用这种弱监督方式在大数据集上预训练的模型,在标准语音情感识别数据集上进行微调后,其性能优于其他基线模型,且具有更高的标签效率。尽管仅在文本派生标签上预训练,我们证明其生成的表示能够有效建模语音的韵律内容。