Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
翻译:自回归模型通常应用于离散标记序列,但近期研究表明以自回归方式生成连续嵌入序列同样可行。然而,此类连续自回归模型在生成长序列时可能因推理过程中的误差累积而导致生成质量下降。我们提出一种创新方法,通过在训练期间向输入嵌入注入随机噪声来解决该问题。此方法使模型能够适应推理阶段不同程度的误差。我们进一步通过引入低强度噪声的推理过程来减少误差累积。在音乐音频生成任务上的实验表明,连续自回归模型在保持长序列音频质量的同时,显著优于现有自回归与非自回归方法。这项工作为在纯自回归设定下生成连续嵌入开辟了道路,为实时交互式生成应用提供了新的可能性。