Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.

翻译：隐喻想象力——将看似无关的概念联系起来的能力——是人类认知与交流的基础。尽管语言隐喻的理解已取得显著进展，但把握多模态隐喻（如互联网迷因中的隐喻）因其非常规表达与隐含意义而面临独特挑战。现有跨模态隐喻识别方法常难以弥合字面义与比喻义的鸿沟。此外，生成式方法虽前景广阔，但利用大规模语言模型或文本-图像模型会带来高昂计算成本。本文提出**概念漂移引导的层归一化调优**（CDGLT），一种新颖且训练高效的多模态隐喻识别框架。CDGLT包含两项关键创新：（1）概念漂移机制——利用CLIP编码器跨模态嵌入的球面线性插值生成新的发散概念嵌入，该漂移概念有助于缓解字面特征与比喻任务之间的差距；（2）提示构建策略——基于预训练语言模型适配特征提取与融合方法以完成多模态隐喻识别任务。CDGLT在MET-Meme基准上取得最优性能，同时相较现有生成方法显著降低训练成本。消融实验验证了概念漂移与所提层归一化调优方法的有效性。本方法标志着向高效准确的多模态隐喻理解迈出重要一步。代码开源地址：\href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}。