Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this work, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
翻译:诸如CNN、Transformer等深度学习模型在端到端音频标记任务中已取得显著成果。近期研究表明,尽管CNN通过堆叠多层结构,但其感受野仍存在严重局限性。Transformer虽能通过自注意力机制捕获全局上下文,但其将频谱图视为序列化图块的处理方式缺乏灵活性,难以捕捉不规则音频对象。本文提出一种名为ATGNN的新型图神经网络架构,通过将频谱图视为图结构实现更灵活的处理方式。ATGNN不仅融合了CNN的局部特征提取能力与图神经网络的全局信息共享特性,还能映射可学习类别嵌入与对应频谱区域之间的语义关联。我们在两项音频标记任务上评估了ATGNN:在FSD50K数据集上达到0.585 mAP,在AudioSet均衡数据集上达到0.335 mAP,在显著减少可学习参数数量的情况下取得了与基于Transformer模型相当的性能。