We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
翻译:本研究聚焦于细粒度文本到音频(T2A)生成任务。尽管现有模型能够根据文本描述合成高质量音频,但它们通常缺乏对响度、音高及声音事件等属性的精确控制。不同于以往针对特定控制类型重新训练模型的方法,我们提出在预训练的T2A主干网络上训练ControlNet模型,以实现对响度、音高和事件序列的可控生成。我们引入了两种设计:T2A-ControlNet与T2A-Adapter,并证明T2A-Adapter模型能以更高效的结构实现强大的控制能力。仅增加3800万参数,T2A-Adapter便在AudioSet-Strong数据集上取得了事件级别和片段级别F1分数的先进性能。我们进一步将该框架扩展至音频编辑领域,提出T2A-Editor模型,用于根据指令指定的时间位置移除或插入音频事件。我们将发布模型、代码、数据集流水线及基准测试,以支持未来可控音频生成与编辑的研究。