The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM.
翻译:视听问答(AVQA)任务旨在回答与视频中各类视觉对象、声音及其交互相关的问题。此类天然多模态视频包含丰富且复杂的动态视听成分,其中仅有部分内容与给定问题密切相关。因此,有效感知与给定问题相关的视听线索对于正确回答问题至关重要。本文提出一种时空感知模型(TSPM),旨在使模型能够感知与问题相关的关键视觉与听觉线索。具体而言,考虑到使用视觉语言预训练模型将非陈述式问题与视觉表征对齐至同一语义空间存在挑战,我们基于问题模板构建陈述式语句提示,以协助时序感知模块更好地识别与问题相关的关键片段。随后,设计空间感知模块以融合选定片段的视觉标记,从而突出关键潜在目标,继而通过与音频的跨模态交互感知潜在的声学敏感区域。最后,整合来自这些模块的显著时空线索以回答问题。在多个AVQA基准测试上的大量实验表明,我们的框架不仅在理解视听场景方面表现优异,还能有效回答复杂问题。代码发布于 https://github.com/GeWu-Lab/TSPM。