We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a \textit{training-free} fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.
翻译:我们提出RECAP(检索增强音频描述生成系统),一种新颖且高效的音频描述生成系统,该系统基于输入音频以及从数据存储器中检索到的与该音频相似的描述来生成文本描述。此外,所提方法无需任何额外微调即可迁移到任意领域。为生成音频样本的描述,我们利用音频-文本模型CLAP从可替换的数据存储器中检索相似描述,并以此构建提示。随后,将该提示输入至GPT-2解码器,并在CLAP编码器与GPT-2之间引入交叉注意力层,以将音频作为条件进行描述生成。在两个基准数据集Clotho和AudioCaps上的实验表明,RECAP在领域内场景下取得具有竞争力的性能,并在跨领域场景下实现显著提升。此外,由于RECAP能够以无需训练的方式利用大规模纯文本描述数据存储器,它展现出为训练中未见的新型音频事件以及包含多种事件的复合音频生成描述的特有能力。为促进该领域研究,我们还发布了AudioSet、AudioCaps及Clotho数据集中超过15万条新增弱标注描述。