In recent years, large-scale pre-trained multimodal models (LMM) generally emerge to integrate the vision and language modalities, achieving considerable success in various natural language processing and computer vision tasks. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this contex, we propose a novel prompt-based multimodal interaction strategy inspired by human memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP). Our proposed method involves in two stages as in human memory strategy: the acquiring stage, and the consolidation and activation stage. We utilize temporal prompts on intermediate layers to imitate the acquiring stage, leverage similarity-based prompt interaction to imitate memory consolidation, and employ prompt generation strategy to imitate memory activation. The main strength of our paper is that we interact the prompt vectors on intermediate layers to leverage sufficient information exchange between modalities, with compressed trainable parameters and memory usage. We achieve competitive results on several datasets with relatively small memory usage and 2.0M of trainable parameters (about 1% of the pre-trained foundation model).
翻译:近年来,大规模预训练多模态模型(LMM)普遍涌现以融合视觉和语言模态,在各类自然语言处理和计算机视觉任务中取得了显著成功。然而,LMM规模的持续增长导致针对下游任务的微调面临巨大的计算成本。为此,基于提示的交互策略被研究用于更高效地对齐模态。在此背景下,我们受人类记忆策略启发,提出一种新颖的基于提示的多模态交互策略——记忆启发的时序提示交互(MITP)。该方法包含与人类记忆策略类似的两个阶段:获取阶段、巩固与激活阶段。我们利用中间层的时序提示模拟获取阶段,借助基于相似性的提示交互模拟记忆巩固,并采用提示生成策略模拟记忆激活。本文的核心优势在于:通过交互中间层的提示向量在模态间实现充分信息交换,同时压缩可训练参数与内存占用。我们以相对较小的内存占用及2.0M可训练参数(约为预训练基础模型的1%),在多个数据集上取得了具有竞争力的结果。