Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata, including song audio features provided by Spotify, song lyrics, and novel metadata-based features (title topic, popularity continuity and genre class). Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron. Our results show that Random Forest (RF) and Logistic Regression (LR) with all features (including novel features, song audio features and lyrics features) outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively. Our findings also demonstrate the utility of our novel music metadata features, which contributed most to the models' discriminative performance.
翻译:热门歌曲预测作为音乐信息检索(MIR)领域的新兴方向之一,始终面临巨大挑战。理解歌曲走红成因对音乐产业整体发展具有显著价值。既有研究多聚焦于音频特征进行预测。本研究通过引入更多替代性元数据,包括Spotify平台提供的歌曲音频特征、歌词特征以及新型元数据特征(歌曲主题、流行度持续性、流派类别),旨在提升对Billboard Hot 100榜单前十名热门歌曲的预测效果。研究采用五种机器学习方法:k近邻算法、朴素贝叶斯、随机森林、逻辑回归与多层感知机。实验结果表明,基于全部特征(包含新型特征、音频特征与歌词特征)的随机森林(RF)与逻辑回归(LR)模型表现最优,准确率分别达89.1%与87.2%,AUC值分别达0.91与0.93。本研究同时证实了新型音乐元数据特征的有效性,该类特征对模型判别性能的贡献最为显著。