We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.
翻译:本文提出RECAP(检索增强的音频描述生成系统),这是一种新颖高效的音频描述生成系统,能够根据输入音频及从数据存储中检索到的相似音频描述生成对应文字描述。此外,所提方法无需额外微调即可迁移至任意领域。针对特定音频样本生成描述时,我们利用音频-文本模型CLAP从可替换数据存储中检索相似描述,并以此构建提示模板。随后将该提示输入GPT-2解码器,同时在CLAP编码器与GPT-2之间引入交叉注意力层,以实现基于音频条件的描述生成。在Clotho和AudioCaps两个基准数据集上的实验表明,RECAP在领域内设置中取得具有竞争力的性能,在领域外设置中则实现显著提升。得益于其能够以免训练方式利用大规模纯文本描述数据存储,RECAP展现出对训练中未见过的新型音频事件及多事件复合音频进行描述的独特能力。为推进该领域研究,我们同步发布了针对AudioSet、AudioCaps和Clotho数据集的15万条以上新增弱标注描述文本。