The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.
翻译:音频与视觉的结合一直是多模态领域的研究热点。近期,一项名为音频-视觉分割(AVS)的新任务被提出,旨在定位并分割给定视频中的发声物体。该任务首次要求音频驱动的像素级场景理解,带来了重大挑战。本文提出AVSegFormer——一个基于Transformer架构的AVS任务新框架。具体而言,我们在Transformer解码器中引入音频查询和可学习查询,使网络能够选择性关注感兴趣的视觉特征。此外,我们设计了音频-视觉混合器,通过增强相关空间通道并抑制无关通道来动态调整视觉特征。同时,我们创新性地引入中间掩码损失以增强解码器监督,促使网络生成更精确的中间预测。大量实验表明,AVSegFormer在AVS基准测试中取得了最先进的结果。代码已开源在 https://github.com/vvvb-github/AVSegFormer。