This study presents a machine learning framework for assessing similarity between audio content and predicting sentiment score. We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments, serving as proxy labels for content quality. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations through Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, and Temporal characteristics. Leveraging these features, we train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively. Improvements over a baseline model based on absolute difference metrics are observed. These results demonstrate the potential of machine learning to capture sentiment and similarity in audio, offering an adaptable framework for AI applications in media analysis.
翻译:本研究提出一种用于评估音频内容相似度并预测情感得分的机器学习框架。我们构建了一个数据集,包含来自YouTube音乐翻唱作品的音频样本、原始歌曲音频,以及从用户评论中提取的情感得分(作为内容质量的代理标签)。我们的方法包括大量预处理步骤:将音频信号分割为30秒窗口,并通过梅尔频率倒谱系数(MFCC)、色度特征、谱对比度特征和时序特征提取高维特征表示。基于这些特征,我们训练回归模型以预测0-100范围内的情感得分,分别获得了3.420、5.482、2.783和4.212的均方根误差(RMSE)值。相较于基于绝对差异度量的基线模型,本方法取得了改进。这些结果表明机器学习在捕捉音频情感与相似度方面的潜力,为媒体分析领域的AI应用提供了一个可适配的框架。