Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.
翻译:超越表层内容关系挖掘的深度多模态语义理解在人工智能领域受到越来越多关注。高质量多模态数据采集与标注的挑战凸显了少样本学习的重要性。本文聚焦于此背景下的两个关键任务:少样本多模态讽刺检测与多模态情感分析。为解决这些问题,我们提出了基于统一视觉语言模型的块感知提示融合多模态提示专家混合框架(MoPE-BAF),这是一种新颖的多模态软提示框架。具体而言,我们设计了三种软提示专家:文本提示和图像提示分别提取模态特定特征以丰富单模态表示,以及统一提示辅助多模态交互。此外,我们将Transformer层重组为多个块,并在相邻块之间引入跨模态提示注意力,从而平滑从单模态表示到多模态融合的过渡。在少样本设置下的MSD和MSA数据集上,我们提出的模型不仅以仅2%的参数(150M)超越了8.2B参数的InstructBLIP模型,还显著优于其他广泛使用的VLM提示方法或任务特定方法。