Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

翻译：近年来，大规模预训练模型在音视频下游任务中的应用取得了显著成果。然而，这些主要基于单模态非受限数据集训练的模型，在多模态任务的特征提取中仍面临挑战，导致性能欠佳。其局限性源于编码过程中引入的无关模态特异性信息，这会对下游任务性能产生负面影响。为解决这一问题，本文提出一种新颖的双引导空间-通道-时间（DG-SCT）注意力机制。该机制将音频与视觉模态作为软提示，根据当前多模态输入特征动态调整预训练模型参数。具体而言，DG-SCT模块在预训练的音视频编码器中引入可训练的跨模态交互层，能够在保持大规模预训练模型参数冻结的同时，沿空间、通道与时间维度自适应地提取当前模态的关键信息。实验评估表明，我们的模型在多个下游任务（包括AVE、AVVP、AVS及AVQA）中均取得了最先进性能。此外，该模型在极具挑战性的少样本与零样本场景中同样展现出优异表现。源代码与预训练模型已开源于https://github.com/haoyi-duan/DG-SCT。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日