Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple yet effective method of dealing with small-scaled datasets by leveraging a pre-trained language model. We keep the language model frozen to maintain the expressivity for text generation, and we only learn to extract global and temporal features from the input audio. To bridge a modality gap between the audio features and the language model, we employ mapping networks that translate audio features to the continuous vectors the language model can understand, called prefixes. We evaluate our proposed method on the Clotho and AudioCaps dataset and show our method outperforms prior arts in diverse experimental settings.
翻译:音频字幕生成旨在从环境声音中生成文本描述。该任务面临的一个挑战是由于缺乏音频-文本配对训练数据而导致的泛化困难。本研究提出一种简单而有效的方法,通过利用预训练语言模型来处理小规模数据集。我们保持语言模型参数冻结以维持其文本生成的表达能力,仅学习从输入音频中提取全局特征与时序特征。为弥合音频特征与语言模型之间的模态差异,我们采用映射网络将音频特征转换为语言模型能够理解的连续向量(即前缀)。在Clotho和AudioCaps数据集上的评估表明,本方法在多种实验设置下均优于现有技术。