Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.
翻译:回答与视听场景相关的问题,即AVQA任务,正变得越来越流行。一个关键挑战是准确识别并沿时间线跟踪与问题相关的发声对象。本文提出了一种新的补丁级发声对象跟踪(PSOT)方法。该方法始于一个运动驱动的关键补丁跟踪(M-KPT)模块,该模块依赖视觉运动信息来识别具有显著运动的突出视觉补丁,这些补丁更可能与发声对象及问题相关。我们测量相邻视频帧之间的补丁级运动强度图,并利用它来构建和引导一个运动驱动的图网络。同时,我们设计了一个声音驱动的KPT(S-KPT)模块来显式跟踪发声补丁。该模块同样包含一个图网络,其邻接矩阵通过视听对应图进行正则化。M-KPT和S-KPT模块在每个时间片段上并行执行,从而实现对显著对象和发声对象的平衡跟踪。基于跟踪到的补丁,我们进一步提出了一个问题驱动的KPT(Q-KPT)模块,以保留与问题高度相关的补丁,确保模型聚焦于最具信息量的线索。视听问题特征在这些模块的处理过程中不断更新,随后被聚合用于最终答案预测。在标准数据集上的大量实验证明了我们方法的有效性,即使与近期基于大规模预训练的方法相比,也取得了具有竞争力的性能。