Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding. This paper proposes a new target-aware joint spatio-temporal grounding network for AVQA. It consists of two key components: the target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG). The TSG can focus on audio-visual cues relevant to the query subject by utilizing explicit semantics from the question. Unlike previous two-stream temporal grounding modules that required an additional audio-visual fusion module, JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler single-stream architecture. The temporal synchronization between audio and video in the JTG is facilitated by our proposed cross-modal synchrony loss (CSL). Extensive experiments verified the effectiveness of our proposed method over existing state-of-the-art methods.
翻译:音视频问答(AVQA)是一项具有挑战性的任务,需要对多模态上下文进行多步时空推理。现有方法通常依赖于对音视频场景进行精细的目标无关解析以实现空间定位,同时将音频和视频视为独立实体进行时间定位。本文提出了一种面向目标联合时空定位网络用于AVQA,该网络包含两个关键组件:面向目标空间定位模块(TSG)和单流联合音视频时间定位模块(JTG)。TSG通过利用问题中的显式语义信息聚焦于与查询主体相关的音视频线索;与需要额外音视频融合模块的传统双流时间定位方法不同,JTG将音视频融合与问题感知时间定位整合为更简洁的单流架构,并通过本文提出的跨模态同步损失(CSL)促进音视频间的时间同步。大量实验验证了所提方法相较于现有最优方法的有效性。