Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.
翻译:语音情感识别模型通常使用单一类别标签,忽视了人类情感固有的模糊性。模糊情感识别通过将情感表示为概率分布来解决这一问题,但由于从稀疏人工标注推断出的真实分布不可靠,其进展受到限制。本文探讨大型音频-语言模型是否能够通过生成高质量的合成标注来缓解标注瓶颈。我们提出一个利用ALMs创建合成感知代理的框架,通过增强人工标注来提高真实分布的可靠性。我们通过统计分析这些代理与人类分布的一致性来验证其有效性,并通过使用增强的情感分布对ALMs进行微调来评估其影响。此外,为解决类别不平衡问题并实现无偏评估,我们提出了DiME-Aug——一种分布感知的多模态情感增强策略。在IEMOCAP和MSP-Podcast数据集上的实验表明,合成标注能有效改善情感分布,尤其在标注一致性较高的低模糊性区域效果显著。然而,对于人类分歧较大的高度模糊情感,其改善效果有限。本研究首次证明ALMs能够缓解模糊情感识别中的标注稀缺问题,但同时也指出需要更先进的提示或生成策略来处理高度模糊的情况。