The Music Emotion Recognition (MER) field has seen steady developments in recent years, with contributions from feature engineering, machine learning, and deep learning. The landscape has also shifted from audio-centric systems to bimodal ensembles that combine audio and lyrics. However, a severe lack of public and sizeable bimodal databases has hampered the development and improvement of bimodal audio-lyrics systems. This article proposes three new audio, lyrics, and bimodal MER research datasets, collectively called MERGE, created using a semi-automatic approach. To comprehensively assess the proposed datasets and establish a baseline for benchmarking, we conducted several experiments for each modality, using feature engineering, machine learning, and deep learning methodologies. In addition, we propose and validate fixed train-validate-test splits. The obtained results confirm the viability of the proposed datasets, achieving the best overall result of 79.21% F1-score for bimodal classification using a deep neural network.
翻译:近年来,音乐情感识别领域在特征工程、机器学习和深度学习方面取得了稳步发展。研究格局也从以音频为中心的系统转向结合音频与歌词的双模态集成系统。然而,公开且规模可观的双模态数据库的严重缺乏,阻碍了音频-歌词双模态系统的开发与改进。本文通过半自动方法构建了三个新的音频、歌词及双模态音乐情感识别研究数据集,统称为MERGE。为全面评估所提数据集并建立基准测试基线,我们针对每种模态使用特征工程、机器学习及深度学习方法进行了多项实验。此外,我们提出并验证了固定的训练-验证-测试划分方案。所得结果证实了所提数据集的可行性,其中使用深度神经网络的双模态分类取得了79.21% F1分数的总体最佳结果。