Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples.
翻译:语音情感识别(SER)深度学习方法取得了显著进展。然而,训练SER系统仍面临挑战,需要耗费大量时间和昂贵资源。与许多其他机器学习任务类似,为SER获取数据集需要大量数据标注工作,包括转录和标签标注。这些标注过程在试图扩展传统SER系统时带来了困难。基础模型的最新发展产生了巨大影响,催生了ChatGPT等应用。这些模型增强了人机交互能力,包括为SER等领域的数据库建设带来了独特可能性。在本研究中,我们探索使用基础模型辅助自动化SER,从转录与标注到增强。我们的研究表明,这些模型可生成转录文本以提升仅依赖语音数据的SER系统性能。此外,我们发现从转录语音中标注情感仍具挑战性,但融合多个大语言模型的输出可提高标注质量。最后,研究结果表明,通过标注未标注语音样本来扩展现有语音情感数据集具有可行性。