The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline, which lacks this objective.
翻译:音频字幕生成任务在本质上与图像和视频字幕生成类似,但受到的关注却少得多。我们提出了音频字幕生成的三个期望特性:(i)生成文本的流畅性,(ii)生成文本与输入音频的忠实度,以及与之相关的(iii)可听性,即仅基于音频能够被感知的质量。我们的方法是一种零样本方法,即我们并不学习执行字幕生成,而是通过一个推理过程来实现,该过程涉及与三个期望特性对应的三个网络:(i)一个大型语言模型——为方便起见,我们使用GPT-2,(ii)一个能提供音频文件和文本之间匹配分数的模型——我们采用名为ImageBind的多模态匹配网络,以及(iii)一个文本分类器,它使用通过设计提示指令GPT-4自动收集的数据集进行训练,这些提示旨在引导生成可听和不可听的句子。我们在AudioCap数据集上展示了结果,证明与缺乏该目标的基线相比,可听性引导显著提升了性能。