Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
翻译:音频字幕生成旨在为音频片段生成文本描述。现实世界中,许多物体会产生相似的声音,如何准确识别模糊声音是音频字幕生成面临的主要挑战。受人类多模态感知能力的启发,本研究提出视觉感知的音频字幕生成方法,通过利用视觉信息辅助模糊发声物体的描述。具体而言,我们采用现成的视觉编码器提取视频特征,并将视觉特征融入音频字幕系统。为更好地利用音视频互补信息,我们进一步提出音视频注意力机制,该机制可自适应整合音频与视觉上下文,并消除隐空间中的冗余信息。在最大音频字幕数据集AudioCaps上的实验结果表明,本方法在机器翻译指标上达到了最优性能。