Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, an audio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net. Code is available at: \href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}
翻译:音视频问答(Audio-Visual Question Answering, AVQA)任务旨在回答关于视频中不同视觉对象、声音及其关联的问题。这类天然多模态视频由丰富且复杂的动态音视频成分构成,其中大部分成分可能与给定问题无关,甚至会对回答感兴趣的内容产生干扰。相反,仅聚焦于与问题相关的音视频内容,既能排除干扰,又能使模型更高效地作答。本文提出一种渐进式时空感知网络(Progressive Spatio-Temporal Perception Network, PSTP-Net),该网络包含三个模块,可逐步识别与问题相关的关键时空区域。具体而言,首先引入时间片段选择模块,选取与给定问题最相关的音视频片段;随后利用空间区域选择模块,从已选时间片段中提取与问题最相关的空间区域;为进一步优化特征选择,采用音频引导的视觉注意力模块感知音频与所选空间区域之间的关联;最后整合各模块提取的时空特征用于回答问题。在公开的MUSIC-AVQA和AVQA数据集上的大量实验结果充分证明了PSTP-Net的有效性与高效性。代码开源地址:\href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}