State-of-the-art audio captioning methods typically use the encoder-decoder structure with pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the convolution operation used in PANNs is limited in capturing the long-time dependencies within an audio signal, thereby leading to potential performance degradation in audio captioning. This letter presents a novel method using graph attention (GraphAC) for encoder-decoder based audio captioning. In the encoder, a graph attention module is introduced after the PANNs to learn contextual association (i.e. the dependency among the audio features over different time frames) through an adjacency graph, and a top-k mask is used to mitigate the interference from noisy nodes. The learnt contextual association leads to a more effective feature representation with feature node aggregation. As a result, the decoder can predict important semantic information about the acoustic scene and events based on the contextual associations learned from the audio signal. Experimental results show that GraphAC outperforms the state-of-the-art methods with PANNs as the encoders, thanks to the incorporation of the graph attention module into the encoder for capturing the long-time dependencies within the audio signal. The source code is available at https://github.com/LittleFlyingSheep/GraphAC.
翻译:当前最先进的音频字幕生成方法通常采用编码器-解码器结构,并以预训练音频神经网络(PANNs)作为编码器进行特征提取。然而,PANNs中使用的卷积操作在捕捉音频信号中的长时间依赖关系方面存在局限性,从而导致音频字幕生成的性能可能下降。本文提出了一种基于图注意力(GraphAC)的新方法,用于编码器-解码器结构的音频字幕生成。在编码器中,在PANNs之后引入图注意力模块,通过邻接图学习上下文关联(即不同时间帧上音频特征之间的依赖关系),并使用top-k掩码来减轻噪声节点的干扰。学习到的上下文关联通过特征节点聚合实现了更有效的特征表示。因此,解码器能够基于从音频信号中学习到的上下文关联,预测关于声学场景和事件的重要语义信息。实验结果表明,由于在编码器中融入了图注意力模块来捕捉音频信号中的长时间依赖关系,GraphAC的性能优于以PANNs为编码器的最先进方法。源代码可在https://github.com/LittleFlyingSheep/GraphAC获取。