Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.
翻译:自动音频描述(AAC)任务旨在为音频流生成自然语言描述。常规AAC系统需要人工标注的音频片段及对应文本描述训练数据。这种音频-文本配对数据的创建成本高昂,导致该任务普遍面临数据稀缺问题。本研究针对这一关键限制,提出了一种仅使用文本训练AAC系统的方案。该方法利用对比学习训练的音频-文本模型(如CLAP)构建多模态空间:训练阶段,解码器基于预训练CLAP文本编码器生成描述;推理阶段,文本编码器替换为预训练CLAP音频编码器。为弥合文本与音频嵌入间的模态差距,我们提出在训练中采用噪声注入或可学习适配器。实验表明,所提出的纯文本框架在性能上与使用配对音频训练的先进模型相当,验证了高效的文本到音频迁移可行性。最终,我们在无需音频或人工文本标注的条件下,展示了风格化音频描述与描述增强的应用。