MusER: Musical Element-Based Regularization for Generating Symbolic Music with Emotion

Generating music with emotion is an important task in automatic music generation, in which emotion is evoked through a variety of musical elements (such as pitch and duration) that change over time and collaborate with each other. However, prior research on deep learning-based emotional music generation has rarely explored the contribution of different musical elements to emotions, let alone the deliberate manipulation of these elements to alter the emotion of music, which is not conducive to fine-grained element-level control over emotions. To address this gap, we present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements, investigate their roles in distinguishing emotions, and further manipulate elements to alter musical emotions. Specifically, we propose a novel VQ-VAE-based model named MusER. MusER incorporates a regularization loss to enforce the correspondence between the musical element sequences and the specific dimensions of latent variable sequences, providing a new solution for disentangling discrete sequences. Taking advantage of the disentangled latent vectors, a two-level decoding strategy that includes multiple decoders attending to latent vectors with different semantics is devised to better predict the elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space and gain insights into the contribution of distinct elements to the emotional dimensions (i.e., arousal and valence). Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music in both objective and subjective evaluation. Besides, we rearrange music through element transfer and attempt to alter the emotion of music by transferring emotion-distinguishable elements.

翻译：带情感的音乐生成是自动音乐生成领域的重要任务，其中情感通过多种随时间变化且相互协作的音乐元素（如音高和时值）得以激发。然而，现有基于深度学习的带情感音乐生成研究鲜少探索不同音乐元素对情感的贡献，更未涉及通过人为调控这些元素来改变音乐情感的方法，这不利于对情感的细粒度元素级控制。为填补这一空白，我们提出一种创新方法，通过在隐空间中采用基于音乐元素的正则化来解耦不同元素，探究其在区分情感中的作用，并进一步通过操控元素来改变音乐情感。具体而言，我们提出一种基于VQ-VAE的新型模型MusER。MusER引入正则化损失函数，强制音乐元素序列与隐变量序列的特定维度对应，为离散序列解耦提供了新方案。借助解耦后的隐向量，我们设计了包含多个解码器的两级解码策略，这些解码器可关注不同语义的隐向量以更优地预测元素。通过隐空间可视化，我们证明MusER可产生解耦且可解释的隐空间，并深入理解不同元素对情感维度（即唤醒度和效价）的贡献。实验结果表明，MusER在客观和主观评估中均优于现有最先进的带情感音乐生成模型。此外，我们通过元素迁移重组音乐，并尝试通过迁移情感可区分元素来改变音乐情感。