In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.
翻译:近年来,大规模预训练模型在音视频下游任务中的应用取得了显著成果。然而,这些主要基于单模态非受限数据集训练的模型,在多模态任务的特征提取中仍面临挑战,导致性能欠佳。其局限性源于编码过程中引入的无关模态特异性信息,这会对下游任务性能产生负面影响。为解决这一问题,本文提出一种新颖的双引导空间-通道-时间(DG-SCT)注意力机制。该机制将音频与视觉模态作为软提示,根据当前多模态输入特征动态调整预训练模型参数。具体而言,DG-SCT模块在预训练的音视频编码器中引入可训练的跨模态交互层,能够在保持大规模预训练模型参数冻结的同时,沿空间、通道与时间维度自适应地提取当前模态的关键信息。实验评估表明,我们的模型在多个下游任务(包括AVE、AVVP、AVS及AVQA)中均取得了最先进性能。此外,该模型在极具挑战性的少样本与零样本场景中同样展现出优异表现。源代码与预训练模型已开源于https://github.com/haoyi-duan/DG-SCT。