Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
翻译:音频描述旨在为音频片段生成文本描述。现实世界中,许多物体会产生相似的声音。如何准确识别模糊声音是音频描述领域的一大挑战。受人类多模态感知特性的启发,本文提出视觉感知音频描述方法,利用视觉信息辅助描述发音模糊的物体。具体而言,我们引入现成的视觉编码器提取视频特征,并将这些视觉特征融入音频描述系统。此外,为更有效利用互补的音视频上下文信息,我们提出一种自适应音视频注意力机制,该机制能自适应整合音频与视觉上下文,并去除隐空间中的冗余信息。在最大音频描述数据集AudioCaps上的实验结果表明,本方法在机器翻译指标上取得了最优性能。