Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.

翻译：语音与音乐的情感识别因声学特征的重叠而存在相似性，这引发了跨领域知识迁移的研究兴趣。然而，鉴于语音与音乐的自监督学习模型在跨领域研究中鲜少被应用，二者之间共享的声学线索——尤其是由自监督学习模型编码的特征——在很大程度上尚未得到探索。本研究重新审视情感语音与音乐之间的声学相似性，首先分析了用于语音情感识别与音乐情感识别的自监督学习模型的逐层行为特征。进一步，我们通过比较两阶段微调过程中的多种方法进行跨领域适应，探究利用音乐数据提升语音情感识别性能以及利用语音数据提升音乐情感识别性能的有效途径。最后，我们使用弗雷歇音频距离针对具体情感类别探究情感语音与音乐之间的声学相似性，揭示了语音与音乐自监督学习模型中普遍存在的情感偏差问题。研究结果表明：尽管语音与音乐的自监督学习模型确实能够捕捉共享的声学特征，但由于训练策略与领域特异性的影响，它们对不同情感类别的表征行为存在差异。此外，参数高效微调方法能够通过相互借鉴领域知识来提升语音情感识别与音乐情感识别的性能。本研究为理解情感语音与音乐的声学相似性提供了新视角，并揭示了通过跨领域泛化改进语音情感识别与音乐情感识别系统的潜力。