Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting their performance. In this paper, we investigate the use of pre-trained AudioLDM, the state-of-the-art model for text-to-audio generation, as the backbone for sound generation. Our study demonstrates the advantages of using pre-trained models for text-to-sound generation, especially in data-scarcity scenarios. In addition, experiments show that different training strategies (e.g., training conditions) may affect the performance of AudioLDM on datasets of different scales. To facilitate future studies, we also evaluate various text-to-sound generation systems on several frequently used datasets under the same evaluation protocols, which allow fair comparisons and benchmarking of these methods on the common ground.
翻译:深度神经网络近期在基于文本提示的声音生成领域取得了突破性进展。尽管现有文本到声音生成模型展现出良好的性能,但其在小规模数据集(如过拟合问题)上仍面临显著限制。本文探究了当前最先进的文本到音频生成模型——预训练AudioLDM作为声音生成骨干网络的可行性。研究表明,使用预训练模型进行文本到声音生成具有显著优势,尤其在数据稀缺场景中表现突出。此外,实验表明不同训练策略(如训练条件)会影响AudioLDM在不同规模数据集上的性能表现。为促进后续研究,我们还在相同评估协议下,对多个常用数据集上的各类文本到声音生成系统进行了综合评估,从而为这些方法提供公平的比较基准。