Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
翻译:视听分割旨在对给定视频每一帧中的发声对象进行分割。为了将发声对象与无声对象区分开来,需要同时建立视听语义对应关系和时序交互。先前的方法采用多帧跨模态注意力机制,同时对音频特征与多帧视觉特征进行像素级交互,这种方式既冗余又隐式。本文提出了一种基于音频查询的Transformer架构——AQFormer,其中我们定义了一组以音频信息为条件的对象查询,并将每个查询与特定的发声对象相关联。通过使用预定义的音频查询从视觉特征中收集对象信息,建立了音频与视觉模态之间显式的对象级语义对应关系。此外,提出了一种音频桥接时序交互模块,通过音频特征作为桥梁,在多帧之间交换与发声对象相关的信息。在两个视听分割基准上进行了大量实验,结果表明我们的方法达到了最先进的性能,尤其在MS3设置下,M_J和M_F分别提升了7.1%和7.6%。