Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA
翻译:尽管文本到音频(TTA)生成模型已取得显著进展,能够生成高保真音频并具备细粒度上下文理解能力,但这些模型在建模输入文本描述的音频事件间关系方面仍存在困难。然而,以往的TTA方法尚未系统性地探索音频事件关系建模,也未提出增强此能力的框架。在本工作中,我们系统性地研究了TTA生成模型中的音频事件关系建模。我们首先通过以下方式为此任务建立基准:1. 提出一个涵盖现实场景中所有潜在关系的综合关系语料库;2. 引入一个包含常见音频的新音频事件语料库;3. 提出新的评估指标,从多角度评估音频事件关系建模能力。此外,我们提出一个微调框架,以增强现有TTA模型对音频事件关系的建模能力。代码发布于:https://github.com/yuhanghe01/RiTTA