Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.
翻译:可控文本到音频生成旨在根据文本描述合成音频,同时满足用户指定的约束条件,包括事件类型、时间序列以及起始和结束时间戳。这使得对生成音频的内容和时间结构均能实现精确控制。尽管近期取得进展,现有方法在准确时间定位、开放词汇可扩展性和实际效率之间仍面临固有的权衡。为应对这些挑战,我们提出DegDiT,一种新颖的动态事件图引导扩散Transformer框架,用于开放词汇的可控音频生成。DegDiT将描述中的事件编码为结构化动态图。每个图中的节点被设计为表示三个方面:语义特征、时间属性和事件间连接关系。采用图Transformer整合这些节点,并生成上下文事件嵌入,作为扩散模型的引导信号。为确保高质量和多样化的训练数据,我们引入了质量平衡的数据选择流程,该流程结合了分层事件标注与多标准质量评分,从而构建出具有语义多样性的精选数据集。此外,我们提出了共识偏好优化方法,通过多个奖励信号之间的共识来促进音频生成。在AudioCondition、DESED和AudioTime数据集上进行的大量实验表明,DegDiT在各种客观和主观评估指标上均实现了最先进的性能。