Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Current leading Text-To-Audio (TTA) generation models suffer from degraded performance on zero-shot and few-shot settings. It is often challenging to generate high-quality audio for audio events that are unseen or uncommon in the training set. Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Model (LLM)-based knowledge-intensive tasks, we extend the TTA process with additional conditioning contexts. We propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a conditional flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution which generates audio conditioned on text, we augmented the conditioning input with retrieved audio samples that provide additional acoustic information to generate the target audio. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. To evaluate our proposed method, we curated test sets in zero-shot and few-shot settings. Our empirical results show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in-domain setting. In addition, we investigate the effect of different retrieval methods and data sources.

翻译：当前领先的文本到音频生成模型在零样本与小样本场景下性能表现不佳。对于训练集中未见或罕见的音频事件，生成高质量音频通常具有挑战性。受检索增强生成在大语言模型知识密集型任务中成功的启发，我们通过引入额外条件上下文来扩展文本到音频生成流程。本文提出Audiobox TTA-RAG——一种基于条件流匹配音频生成模型Audiobox的新型检索增强文本到音频方法。与仅依赖文本条件生成音频的标准Audiobox方案不同，我们通过检索提供额外声学信息的音频样本来增强条件输入。我们的检索方法无需外部数据库具备标注音频，具有更广泛的实际应用价值。为评估所提方法，我们构建了零样本与小样本场景的测试集。实验结果表明，所提模型能有效利用检索到的音频样本，在多维评估指标上显著提升零样本与小样本文本到音频生成性能，同时保持领域内语义对齐音频的生成能力。此外，我们还探究了不同检索方法与数据源的影响效应。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日