Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/
翻译:基于文本的音频生成模型存在局限性,因为文本无法涵盖音频中的所有信息,导致仅依赖文本时的可控性受限。为解决这一问题,我们提出了一种新颖模型,通过引入额外条件(包括内容(时间戳)和风格(音高轮廓和能量轮廓))作为文本的补充,从而增强现有预训练文本-音频模型的可控性。该方法实现了对生成音频的时间顺序、音高和能量的细粒度控制。为保留生成多样性,我们采用可训练的控制条件编码器(该编码器通过大型语言模型增强)和可训练的融合网络(Fusion-Net)来编码并融合额外条件,同时保持预训练文本-音频模型的权重固定不变。由于缺乏合适的训练数据集和评估指标,我们将现有数据集整合为包含音频及对应条件的新数据集,并采用一系列评估指标来衡量可控性性能。实验结果表明,我们的模型成功实现了细粒度控制,可完成可控音频生成。音频样本及数据集已在 https://conditionaudiogen.github.io/conditionaudiogen/ 公开。