This paper introduces a novel approach to emotion detection in speech using Large Language Models (LLMs). We address the limitation of LLMs in processing audio inputs by translating speech characteristics into natural language descriptions. Our method integrates these descriptions into text prompts, enabling LLMs to perform multimodal emotion analysis without architectural modifications. We evaluate our approach on two datasets: IEMOCAP and MELD, demonstrating significant improvements in emotion recognition accuracy, particularly for high-quality audio data. Our experiments show that incorporating speech descriptions yields a 2 percentage point increase in weighted F1 score on IEMOCAP (from 70.111\% to 72.596\%). We also compare various LLM architectures and explore the effectiveness of different feature representations. Our findings highlight the potential of this approach in enhancing emotion detection capabilities of LLMs and underscore the importance of audio quality in speech-based emotion recognition tasks. We'll release the source code on Github.
翻译:本文提出了一种利用大语言模型进行语音情感检测的新方法。针对LLMs处理音频输入的局限性,我们将语音特征转化为自然语言描述。该方法将这些描述整合到文本提示中,使LLMs无需架构修改即可执行多模态情感分析。我们在IEMOCAP和MELD两个数据集上评估了该方法,证明了情感识别准确率的显著提升,特别是在高质量音频数据上。实验表明,引入语音描述使IEMOCAP的加权F1分数提高了2个百分点(从70.111%升至72.596%)。我们还比较了不同LLM架构,并探索了多种特征表示方法的有效性。研究结果凸显了该方法在增强LLMs情感检测能力方面的潜力,同时强调了音频质量在基于语音的情感识别任务中的重要性。我们将在GitHub上开源相关代码。