Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.
翻译:交互式流式音乐生成有望将生成模型应用于现场表演和协同创作,这是离线模型无法实现的。然而,现有最优模型处于离散自回归范式,其训练和推理均需工业级计算资源。本研究探索能否将音频扩散模型——虽获开源社区广泛支持但具有非流式双向特性——高效改造为可在消费级硬件上运行的交互式模型。通过批判性审视现代逐块外推扩散流程,我们识别出推理阶段的关键效率缺陷,导致其计算效率严格劣于离散自回归模型。为此提出实时音乐扩散模型(LMDMs),通过对生成扩散流程的简单改进,引入逐块KV缓存机制,使推理复杂度恢复并超越离散实时音乐模型(LMMs)。与LMMs不同,LMDMs通过创新的ARC-Forcing范式实现稳定的后训练对齐,无需显式强化学习或奖励模型即可减少误差累积。我们展示了LMDMs在文本条件生成、草图音乐合成及即兴合奏等创意领域的应用。最终通过真实艺术家-AI协作案例,证明LMDMs可作为"生成式延迟"乐器,在消费级游戏笔记本上本地运行时实时转换音乐家即兴演奏的可变音色效果。