We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi-scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parameters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation accessible for interactive applications and resource-constrained environments.
翻译:本文提出一种轻量级潜在扩散模型,用于人声条件伴奏生成,以解决现有音乐人工智能系统中的关键局限。我们引入了一种新颖的软对齐注意力机制,该机制根据扩散时间步自适应地结合局部与全局时间依赖关系,从而高效捕捉多尺度音乐结构。模型在预训练变分自编码器的压缩潜在空间中运行,与最先进系统相比实现了220倍的参数量缩减,同时推理速度提升52倍。实验评估表明,该模型仅用1500万参数即取得有竞争力的性能,在制作质量与内容一致性方面超越OpenAI Jukebox,同时保持合理的音乐连贯性。这种超轻量级架构支持在消费级硬件上实时部署,使得人工智能辅助音乐创作能够应用于交互式场景与资源受限环境。