Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.
翻译:尽管文本到音频生成近期取得了进展,但我们发现目前最先进的模型(如AudioLDM)在类别分布不平衡的数据集(如AudioCaps)上训练时,其生成性能存在偏差。具体而言,这些模型擅长生成常见音频类别,但在稀有类别上表现欠佳,从而降低了整体生成性能。我们将此问题称为长尾文本到音频生成。为解决该问题,我们提出了一种简单的检索增强方法用于文本到音频模型。具体来说,给定输入文本提示,我们首先利用对比语言-音频预训练模型检索相关的文本-音频对,然后将检索到的音频-文本数据特征作为额外条件来指导文本到音频模型的学习。我们采用所提方法增强AudioLDM,并将增强后的系统命名为Re-AudioLDM。在AudioCaps数据集上,Re-AudioLDM取得了最先进的Frechet音频距离1.37,大幅优于现有方法。此外,实验表明Re-AudioLDM能够为复杂场景、稀有音频类别甚至未见过的音频类型生成逼真音频,这体现了其在文本到音频生成任务中的潜力。