Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
翻译:摘要:数据驱动方法在音频描述生成领域具有广阔前景。然而,文本-音频数据在数量和质量的局限性可能导致音频描述生成方法存在偏差。本文提出Synth-AC框架,该框架利用音频生成模型的最新进展与广泛可用的文本语料库生成合成文本-音频对,从而增强文本-音频表征。具体而言,采用文本到音频生成模型AudioLDM,基于图像描述数据集中的标注文本生成合成音频信号。Synth-AC将文本-视觉领域中精细标注的描述扩展至音频描述领域,通过学习合成文本-音频对中的关联关系,增强文本-音频表征。实验表明,Synth-AC框架通过引入文本-视觉领域的高质量标注文本语料,能够有效提升音频描述生成模型性能,为缓解数据稀缺问题提供了可行方案。此外,Synth-AC可便捷适配多种现有先进方法,并带来显著的性能提升。