Automated audio captioning is multi-modal translation task that aim to generate textual descriptions for a given audio clip. In this paper we propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting. The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model which is fine-tuned to maximize the semantic similarity between AudioSet labels and ground truth captions. To mitigate the data scarcity problem of Automated Audio Captioning we introduce transfer learning from an upstream audio-related task and an enlarged in-domain dataset. Moreover, we propose a method to apply Mixup augmentation for AAC. Ablation studies are carried out to investigate how Patchout and text guidance contribute to the final performance. The results show that the proposed techniques improve the performance of our system and while reducing the computational complexity. Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
翻译:自动音频描述是一项多模态翻译任务,旨在为给定的音频片段生成文本描述。本文提出了一种全Transformer架构,该架构采用[1]中提出的Patchout方法,显著降低了计算复杂度并避免了过拟合。描述生成部分地以文本形式标注的AudioSet标签为条件,这些标签由预训练分类模型提取,该模型经过微调以最大化AudioSet标签与真实描述之间的语义相似性。为缓解自动音频描述中的数据稀缺问题,我们引入了来自上游音频相关任务的迁移学习和扩充的领域内数据集。此外,我们提出了一种将Mixup增强应用于自动音频描述的方法。通过消融实验,我们研究了Patchout和文本引导对最终性能的贡献。实验结果表明,所提出的技术提升了系统性能,同时降低了计算复杂度。我们的方法在DCASE 2022挑战赛的Task6A中获得了评审团奖。