The subjective nature of music emotion introduces inherent bias in both recognition and generation, especially when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside the Frechet Audio Distance (FAD), a reference-free evaluation metric. Our study begins with a benchmark evaluation of MER, highlighting the limitations associated with using a single audio encoder and the disparities observed across different measurements. We then propose assessing MER performance using FAD from multiple encoders to provide a more objective measure of music emotion. Furthermore, we introduce an enhanced EMG approach designed to improve both the variation and prominence of generated music emotion, thus enhancing realism. Additionally, we investigate the realism disparities between the emotions conveyed in real and synthetic music, comparing our EMG model against two baseline models. Experimental results underscore the emotion bias problem in both MER and EMG and demonstrate the potential of using FAD and diverse audio encoders to evaluate music emotion objectively.
翻译:音乐情感的主观性在识别与生成过程中引入了固有偏差,尤其在依赖单一音频编码器、情感分类器或评估指标时更为显著。本研究针对音乐情感识别与情感音乐生成展开,采用多样化的音频编码器并结合无参考评估指标——弗雷歇音频距离。研究首先对MER进行基准评估,揭示了使用单一音频编码器的局限性以及不同测量方法间的显著差异。随后,我们提出通过多编码器的FAD评估MER性能,以提供更客观的音乐情感度量。进一步地,我们提出一种改进的EMG方法,旨在增强生成音乐情感的多样性与鲜明度,从而提升真实感。此外,我们探究了真实音乐与合成音乐在情感传达真实感上的差异,并将我们的EMG模型与两种基线模型进行比较。实验结果凸显了MER与EMG中存在的情绪偏差问题,并证明了采用FAD与多样化音频编码器客观评估音乐情感的潜力。