Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/
翻译:基于文本的音频生成模型存在局限性,因为文本无法囊括音频中的所有信息,导致仅依赖文本时的可控性受限。为解决这一问题,我们提出一种新模型,通过引入内容(时间戳)和风格(音高轮廓与能量轮廓)作为文本的补充条件,增强现有预训练文本到音频模型的可控性。该方法实现了对生成音频的时间顺序、音高和能量的细粒度控制。为保留生成的多样性,我们采用可训练的控制条件编码器(该编码器通过大语言模型增强)以及可训练的融合网络(Fusion-Net),在冻结预训练文本到音频模型参数的前提下,对额外条件进行编码与融合。由于缺乏合适的数据集和评估指标,我们将现有数据集整合为一个包含音频及其对应条件的新数据集,并采用一系列评估指标衡量可控性性能。实验结果表明,我们的模型成功实现了细粒度控制,从而完成可控音频生成。音频样本及数据集已公开于 https://conditionaudiogen.github.io/conditionaudiogen/