Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset.
翻译:情感处于连续谱中,但当前模型将情感视为有限取值的离散变量。这种表征方式无法捕捉情感表达的多样性。为更好地表征情感,我们提出使用自然语言描述(即提示词)。本研究解决了自动生成这些提示词并训练模型从音频-提示词对中学习情感表征的挑战。我们利用与情感相关的声学属性(如基频、强度、语速和发音速度)自动生成提示词,即"声学提示词"。通过对比学习目标,模型将语音映射至对应的声学提示词。我们在情感音频检索和语音情感识别任务上评估模型性能。结果表明,声学提示词在多个Precision@K指标上显著提升了模型在EAR任务中的表现。在SER任务中,我们在Ravdess数据集上观察到3.8%的相对准确率提升。