Speech emotion recognition (SER) is crucial in speech understanding and generation. Most approaches are based on either classification models or large language models. Different from previous methods, we propose Gen-SER, a novel approach that reformulates SER as a distribution shift problem via generative models. We propose to project discrete class labels into a continuous space, and obtain the terminal distribution via sinusoidal taxonomy encoding. The target-matching-based generative model is adopted to transform the initial distribution into the terminal distribution efficiently. The classification is achieved by calculating the similarity of the generated terminal distribution and ground truth terminal distribution. The experimental results confirm the efficacy of the proposed method, demonstrating its extensibility to various speech-understanding tasks and suggesting its potential applicability to a broader range of classification tasks.
翻译:语音情感识别(SER)在语音理解与生成中至关重要。现有方法大多基于分类模型或大语言模型。与先前方法不同,我们提出Gen-SER,一种通过生成模型将SER重新表述为分布偏移问题的新方法。我们提出将离散类别标签投影至连续空间,并通过正弦分类编码获取目标分布。采用基于目标匹配的生成模型,将初始分布高效地转化为目标分布。分类通过计算生成的目标分布与真实目标分布之间的相似度来实现。实验结果证实了所提方法的有效性,展示了其向多种语音理解任务的可扩展性,并表明其在更广泛分类任务中具有潜在适用性。