Autoregressive models based on Transformers have become the prevailing approach for generating music compositions that exhibit comprehensive musical structure. These models are typically trained by minimizing the negative log-likelihood (NLL) of the observed sequence in an autoregressive manner. However, when generating long sequences, the quality of samples from these models tends to significantly deteriorate due to exposure bias. To address this issue, we leverage classifiers trained to differentiate between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employ a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilize the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partition the sequences into smaller chunks to ensure that memory constraints are met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrate that our approach outperforms a baseline model trained solely on likelihood maximization.
翻译:基于Transformer的自回归模型已成为生成具有完整音乐结构的音乐作品的主流方法。这类模型通常通过最小化观测序列的负对数似然(NLL)以自回归方式进行训练。然而,当生成长序列时,由于曝光偏差,这些模型生成的样本质量往往会显著下降。为解决此问题,我们利用训练用于区分真实序列与采样序列的分类器来识别这些缺陷。这一观察促使我们探索将对抗性损失作为NLL目标的补充。我们采用预训练的Span-BERT模型作为生成对抗网络(GAN)框架中的判别器,在实验中增强了训练稳定性。为优化GAN框架中的离散序列,我们利用Gumbel-Softmax技巧获得采样过程的可微近似。此外,我们将序列划分为更小的块以满足内存约束。通过人工评估和引入新的判别性指标,我们证明该方法优于仅基于似然最大化训练的基线模型。