In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio gener- ation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
翻译:过去几年中,文本到音频模型作为自动音频生成领域的一项重大进步而出现。尽管它们代表了令人印象深刻的技术进展,但其在音频应用开发中的有效性仍不确定。本文旨在探究这些方面,特别聚焦于环境声音分类任务。本研究分析了使用文本到音频模型生成的数据进行训练时,两种不同环境分类系统的性能表现。研究考虑了两种情况:a) 训练数据集通过来自两种不同文本到音频模型的数据进行增强;b) 训练数据集仅由生成的合成音频组成。在两种情况下,分类任务的性能均在真实数据上进行了测试。结果表明,文本到音频模型在数据集增强方面是有效的,而当仅依赖生成音频时,模型性能会下降。