We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.
翻译:本文提出SoundLoCD,一种新颖的文本到声音生成框架,该框架融合了基于LoRA的条件离散对比隐空间扩散模型。与近期的大规模声音生成模型不同,我们的模型能够在有限计算资源下高效训练。对比学习策略的集成进一步增强了文本条件与生成输出之间的关联,从而实现了连贯且高保真的生成性能。实验表明,SoundLoCD在显著减少计算资源的同时,性能优于基线模型。全面的消融研究进一步验证了SoundLoCD中各组件的贡献。演示页面:\url{https://XinleiNIU.github.io/demo-SoundLoCD/}。