The aim of latent variable disentanglement is to infer the multiple informative latent representations that lie behind a data generation process and is a key factor in controllable data generation. In this paper, we propose a deep neural network-based self-supervised learning method to infer the disentangled rhythmic and harmonic representations behind music audio generation. We train a variational autoencoder that generates an audio mel-spectrogram from two latent features representing the rhythmic and harmonic content. In the training phase, the variational autoencoder is trained to reconstruct the input mel-spectrogram given its pitch-shifted version. At each forward computation in the training phase, a vector rotation operation is applied to one of the latent features, assuming that the dimensions of the feature vectors are related to pitch intervals. Therefore, in the trained variational autoencoder, the rotated latent feature represents the pitch-related information of the mel-spectrogram, and the unrotated latent feature represents the pitch-invariant information, i.e., the rhythmic content. The proposed method was evaluated using a predictor-based disentanglement metric on the learned features. Furthermore, we demonstrate its application to the automatic generation of music remixes.
翻译:潜在变量解耦旨在推断数据生成过程背后的多个有意义的潜在表征,这是可控数据生成的关键因素。本文提出一种基于深度神经网络的自监督学习方法,用于推断音乐音频生成背后的解耦节奏与谐波表征。我们训练了一个变分自编码器,该模型通过两个分别代表节奏与谐波内容的潜在特征生成音频梅尔频谱图。在训练阶段,变分自编码器需根据输入信号的移调版本重构原始梅尔频谱图。每次前向计算时,假设特征向量维度与音程相关,对其中一个潜在特征施加向量旋转操作。因此,在训练完成的变分自编码器中,被旋转的潜在特征表征梅尔频谱图的音高相关信息,而未旋转的潜在特征表征音高无关信息(即节奏内容)。我们采用基于预测器的解耦指标对所学习特征进行了评估,并进一步展示了该方法在音乐混音自动生成中的应用。