Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio--visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial--Temporal--Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio--visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.
翻译:视听问答(AVQA)是一项具有挑战性的多模态任务,它需要联合推理给定视频中的音频、视觉和文本信息,以回答自然语言问题。受视频问答领域近期进展的启发,许多现有的AVQA方法主要侧重于视觉信息处理,利用预训练模型提取对象级和运动级表征。然而,在这些方法中,音频输入主要被视为视频分析的补充,而文本问题信息对视听理解的贡献甚微,因为它通常仅在推理的最终阶段被整合。为了应对这些局限性,我们提出了一种新颖的查询引导的时空频(QSTar)交互方法,该方法有效地融入了查询引导线索,并利用音频信号独特的频域特性,结合空间和时间感知,以增强视听理解。此外,我们引入了一个受提示启发的查询上下文推理(QCR)模块,它引导模型更精确地关注语义相关的音频和视觉特征。在多个AVQA基准上进行的大量实验证明了我们提出的方法的有效性,相较于现有的音频问答、视觉问答、视频问答以及AVQA方法,均取得了显著的性能提升。代码和预训练模型将在发表后发布。