Autoregressive models based on Transformers have become the prevailing approach for generating music compositions that exhibit comprehensive musical structure. These models are typically trained by minimizing the negative log-likelihood (NLL) of the observed sequence in an autoregressive manner. However, when generating long sequences, the quality of samples from these models tends to significantly deteriorate due to exposure bias. To address this issue, we leverage classifiers trained to differentiate between real and sampled sequences to identify these failures. This observation motivates our exploration of adversarial losses as a complement to the NLL objective. We employ a pre-trained Span-BERT model as the discriminator in the Generative Adversarial Network (GAN) framework, which enhances training stability in our experiments. To optimize discrete sequences within the GAN framework, we utilize the Gumbel-Softmax trick to obtain a differentiable approximation of the sampling process. Additionally, we partition the sequences into smaller chunks to ensure that memory constraints are met. Through human evaluations and the introduction of a novel discriminative metric, we demonstrate that our approach outperforms a baseline model trained solely on likelihood maximization.
翻译:基于Transformer的自回归模型已成为生成具有完整音乐结构的作品的主流方法。这类模型通常通过最小化观测序列的负对数似然以自回归方式进行训练。然而,在生成长序列时,由于曝光偏差,这些模型采样得到的样本质量往往会显著下降。为解决该问题,我们利用训练用于区分真实序列与采样序列的分类器来识别这些失败情形。这一发现促使我们探索将对抗性损失作为负对数似然目标的补充。我们采用预训练的Span-BERT模型作为生成对抗网络中的判别器,该设计在实验中增强了训练稳定性。为优化生成对抗网络框架中的离散序列,我们运用Gumbel-Softmax技巧获得采样过程的可微近似。此外,我们将序列划分为更小的块以确保满足内存约束。通过人工评估与引入新型判别指标,我们证明该方法优于仅基于似然最大化训练的基线模型。