Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code available: https://github.com/Labbeti/conette-audio-captioning.
翻译:自动音频描述(AAC)旨在利用编码器-解码器架构生成音频内容的自然语言描述。音频编码器产生的音频嵌入被送入解码器(通常为Transformer解码器)生成描述文本。本文提出的模型在编码器端采用从视觉领域迁移至音频分类任务的ConvNeXt架构,相较于现有模型具有创新性。该模型CNext-trans在AudioCaps(AC)数据集上取得最优结果,在Clotho(CL)数据集上表现优异,而参数量仅为现有模型的四十分之一到四分之一。通过研究无偏编码器对性能的影响,我们探讨了AC数据集因源自AudioSet可能存在的偏差。以广泛使用的PANN-CNN14作为无偏编码器时,SPIDEr评分绝对值下降1.7%(高分表示更优性能)。为提升跨数据集性能,我们尝试融合多AAC数据集(AC\CL\MACS\WavCaps)进行训练。尽管该策略提升了模型的全局性能,但仍逊于在单一目标数据集上训练的专用模型,表明通用模型的局限性。为弥合数据集间的性能差距,我们引入任务嵌入(TE)标记,使模型能够识别每个输入样本的源数据集。本文深入分析了TE对生成描述中词汇形式及声音事件类型内容的影响。最终模型CoNeTTE通过数据集特定任务嵌入增强的无偏CNext-trans模型,在AC和CL数据集上分别取得44.1%和30.5%的SPIDEr分数。代码已开源:https://github.com/Labbeti/conette-audio-captioning