Automated audio captioning (AAC) which generates textual descriptions of audio content. Existing AAC models achieve good results but only use the high-dimensional representation of the encoder. There is always insufficient information learning of high-dimensional methods owing to high-dimensional representations having a large amount of information. In this paper, a new encoder-decoder model called the Low- and High-Dimensional Feature Fusion (LHDFF) is proposed. LHDFF uses a new PANNs encoder called Residual PANNs (RPANNs) to fuse low- and high-dimensional features. Low-dimensional features contain limited information about specific audio scenes. The fusion of low- and high-dimensional features can improve model performance by repeatedly emphasizing specific audio scene information. To fully exploit the fused features, LHDFF uses a dual transformer decoder structure to generate captions in parallel. Experimental results show that LHDFF outperforms existing audio captioning models.
翻译:自动音频描述生成(AAC)旨在为音频内容生成文本描述。现有AAC模型虽取得良好效果,但仅使用编码器的高维表示。由于高维表示包含大量信息,单纯采用高维方法存在信息学习不充分的问题。本文提出了一种新型编码器-解码器模型——低维与高维特征融合网络(LHDFF)。LHDFF采用名为残差PANNs(RPANNs)的新型PANNs编码器来融合低维与高维特征。低维特征包含特定音频场景的有限信息,而低维与高维特征的融合可通过反复强调特定音频场景信息来提升模型性能。为充分利用融合特征,LHDFF采用双Transformer解码器结构并行生成描述。实验结果表明,LHDFF优于现有音频描述生成模型。