The pretrain-then-finetune paradigm has been widely used in various unimodal and multimodal tasks. However, finetuning all the parameters of a pre-trained model becomes prohibitive as the model size grows exponentially. To address this issue, the adapter mechanism that freezes the pre-trained model and only finetunes a few extra parameters is introduced and delivers promising results. Most studies on adapter architectures are dedicated to unimodal or bimodal tasks, while the adapter architectures for trimodal tasks have not been investigated yet. This paper introduces a novel Long Short-Term Trimodal Adapter (LSTTA) approach for video understanding tasks involving audio, visual, and language modalities. Based on the pre-trained from the three modalities, the designed adapter module is inserted between the sequential blocks to model the dense interactions across the three modalities. Specifically, LSTTA consists of two types of complementary adapter modules, namely the long-term semantic filtering module and the short-term semantic interaction module. The long-term semantic filtering aims to characterize the temporal importance of the video frames and the short-term semantic interaction module models local interactions within short periods. Compared to previous state-of-the-art trimodal learning methods pre-trained on a large-scale trimodal corpus, LSTTA is more flexible and can inherit any powerful unimodal or bimodal models. Experimental results on four typical trimodal learning tasks show the effectiveness of LSTTA over existing state-of-the-art methods.
翻译:预测-微调范式已在各种单模态和多模态任务中被广泛采用。然而,随着模型规模的指数级增长,微调预训练模型所有参数变得难以承受。为解决此问题,引入了一种适配器机制,该机制冻结预训练模型,仅微调少量额外参数,并取得了显著成效。大多数适配器架构研究专注于单模态或双模态任务,而针对三模态任务的适配器架构尚未被探索。本文提出了一种新颖的长短期三模态适配器(LSTTA)方法,用于涉及音频、视觉和语言模态的视频理解任务。基于三种模态的预训练模型,设计的适配器模块被插入序列块之间,以建模三种模态之间的密集交互。具体而言,LSTTA由两种互补的适配器模块组成,即长期语义过滤模块和短期语义交互模块。长期语义过滤旨在表征视频帧的时间重要性,而短期语义交互模块则建模短时间段内的局部交互。与先前在大规模三模态语料库上预训练的最新二模态学习方法相比,LSTTA更具灵活性,并能继承任何强大的单模态或双模态模型。在四个典型三模态学习任务上的实验结果表明,LSTTA优于现有最先进方法。