The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining systems. Existing approaches rely on full fine-tuning of large models, making them unsuitable to adapt under resource-constrained settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS-CLIP (Adapter-state Sharing in CLIP), a lightweight framework built on CLIP that inserts adapters only in the upper layers to preserve low-level unimodal representations in the lower layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones to promote efficient cross-modal learning in the upper layers. Experiments on two public benchmarks demonstrate that AdS-CLIP not only outperforms standard PEFT methods but also existing multimodal baselines with significantly fewer trainable parameters.
翻译:社交媒体上多模态图文讽刺内容的日益增多对观点挖掘系统提出了挑战。现有方法依赖于对大型模型进行全面微调,使其难以在资源受限的环境中适应。尽管近期的参数高效微调方法展现出潜力,但其现成应用在讽刺检测等复杂任务上表现不佳。我们提出AdS-CLIP(CLIP中的适配器状态共享),这是一个基于CLIP构建的轻量级框架,仅在高层插入适配器以保留低层中的低级单模态表征,并引入一种新颖的适配器状态共享机制,其中文本适配器引导视觉适配器,以促进高层中的高效跨模态学习。在两个公开基准测试上的实验表明,AdS-CLIP不仅优于标准参数高效微调方法,而且在可训练参数显著减少的情况下超越了现有的多模态基线模型。