CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code available: https://github.com/Labbeti/conette-audio-captioning.

翻译：自动音频描述（AAC）旨在利用编码器-解码器架构生成音频内容的自然语言描述。音频编码器产生的音频嵌入被送入解码器（通常为Transformer解码器）生成描述文本。本文提出的模型在编码器端采用从视觉领域迁移至音频分类任务的ConvNeXt架构，相较于现有模型具有创新性。该模型CNext-trans在AudioCaps（AC）数据集上取得最优结果，在Clotho（CL）数据集上表现优异，而参数量仅为现有模型的四十分之一到四分之一。通过研究无偏编码器对性能的影响，我们探讨了AC数据集因源自AudioSet可能存在的偏差。以广泛使用的PANN-CNN14作为无偏编码器时，SPIDEr评分绝对值下降1.7%（高分表示更优性能）。为提升跨数据集性能，我们尝试融合多AAC数据集（AC\CL\MACS\WavCaps）进行训练。尽管该策略提升了模型的全局性能，但仍逊于在单一目标数据集上训练的专用模型，表明通用模型的局限性。为弥合数据集间的性能差距，我们引入任务嵌入（TE）标记，使模型能够识别每个输入样本的源数据集。本文深入分析了TE对生成描述中词汇形式及声音事件类型内容的影响。最终模型CoNeTTE通过数据集特定任务嵌入增强的无偏CNext-trans模型，在AC和CL数据集上分别取得44.1%和30.5%的SPIDEr分数。代码已开源：https://github.com/Labbeti/conette-audio-captioning

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日