Music generative artificial intelligence (AI) is rapidly expanding music content, necessitating automated song aesthetics evaluation. However, existing studies largely focus on speech, audio or singing quality, leaving song aesthetics underexplored. Moreover, conventional approaches often predict a precise Mean Opinion Score (MOS) value directly, which struggles to capture the nuances of human perception in song aesthetics evaluation. This paper proposes a song-oriented aesthetics evaluation framework, featuring two novel modules: 1) Multi-Stem Attention Fusion (MSAF) builds bidirectional cross-attention between mixture-vocal and mixture-accompaniment pairs, fusing them to capture complex musical features; 2) Hierarchical Granularity-Aware Interval Aggregation (HiGIA) learns multi-granularity score probability distributions, aggregates them into a score interval, and applies a regression within the interval to produce the final score. We evaluated on two datasets of full-length songs: SongEval dataset (AI-generated) and an internal aesthetics dataset (human-created), and compared with two state-of-the-art (SOTA) models. Results show that the proposed method achieves stronger performance for multi-dimensional song aesthetics evaluation.
翻译:音乐生成人工智能(AI)正在迅速扩展音乐内容,这使得自动化的歌曲美学评价成为必要。然而,现有研究主要集中在语音、音频或演唱质量上,对歌曲美学的探索尚不充分。此外,传统方法通常直接预测一个精确的平均意见得分(MOS)值,这难以捕捉人类在歌曲美学评价中感知的细微差别。本文提出了一种面向歌曲的美学评价框架,包含两个新颖的模块:1)多声部注意力融合(MSAF)在人声-伴奏混合对之间建立双向交叉注意力,融合它们以捕捉复杂的音乐特征;2)分层粒度感知区间聚合(HiGIA)学习多粒度得分概率分布,将其聚合为一个得分区间,并在区间内应用回归以产生最终得分。我们在两个全长歌曲数据集上进行了评估:SongEval数据集(AI生成)和一个内部美学数据集(人工创作),并与两个最先进的(SOTA)模型进行了比较。结果表明,所提出的方法在多维歌曲美学评价中实现了更强的性能。