Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/
翻译:基于文本的音频生成模型存在局限性,因为文本无法涵盖音频中的所有信息,导致仅依赖文本时可控性受限。为解决该问题,我们提出一种新型模型,通过引入额外条件——包括内容(时间戳)和风格(音高轮廓与能量轮廓)作为文本的补充,从而增强现有预训练文本到音频模型的可控性。该方法能够对生成音频的时间顺序、音高及能量实现细粒度控制。为保留生成多样性,我们采用可训练的控制条件编码器(该编码器通过大语言模型增强)及可训练的融合网络(Fusion-Net),在冻结预训练文本到音频模型权重的同时,对附加条件进行编码与融合。鉴于缺乏合适的数据集与评估指标,我们将现有数据集整合为一个包含音频及对应条件的新数据集,并采用一系列评估指标衡量可控性性能。实验结果表明,本模型成功实现了细粒度控制,达成可控音频生成。音频样例及数据集已公开于https://conditionaudiogen.github.io/conditionaudiogen/。