Autoregressive models based on Transformers have become the prevailing approach for generating music compositions that exhibit comprehensive musical structure. These models are typically trained by minimizing the negative log-likelihood (NLL) of the observed sequence in an autoregressive manner. However, when generating long sequences, the quality of samples from these models tends to significantly deteriorate due to exposure bias. To address this issue, we leverage classifiers trained to differentiate between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employ a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilize the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partition the sequences into smaller chunks to ensure that memory constraints are met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrate that our approach outperforms a baseline model trained solely on likelihood maximization.
翻译:基于Transformer的自回归模型已成为生成具有完整音乐结构的音乐作品的主流方法。这类模型通常通过以自回归方式最小化观测序列的负对数似然(NLL)进行训练。然而,在生成长序列时,由于曝光偏差,这些模型生成的样本质量往往会显著下降。为解决这一问题,我们利用经过训练的、能够区分真实序列与采样序列的分类器来识别这些失败情况。这一发现促使我们探索将对抗性损失作为NLL目标的补充。我们采用预训练的Span-BERT模型作为生成对抗网络(GAN)框架中的判别器,这提升了实验中的训练稳定性。为在GAN框架中优化离散序列,我们利用Gumbel-Softmax技巧获得了采样过程的可微近似。此外,我们将序列划分为更小的块,以确保满足内存约束。通过人工评估及引入一种新颖的判别性指标,我们证明该方法优于仅基于似然最大化的基线模型。