MemeLens: Multilingual Multitask VLMs for Memes

Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.

翻译：网络迷因作为一种主导性的在线交流与操纵媒介，其意义产生于嵌入式文本、图像与文化语境的交互作用。现有迷因研究分散于不同任务（仇恨言论、厌女症、宣传、情感、幽默）与语言之间，限制了跨领域泛化能力。为填补这一空白，我们提出MemeLens——一个统一的多语言多任务增强解释型视觉语言模型（VLM），用于迷因理解。我们整合了38个公开迷因数据集，将数据集特定标签筛选并映射至包含危害性、目标对象、比喻/语用意图及情感维度等20项任务的共享分类体系。通过对建模范式、任务类别与数据集的全面实证分析，我们发现：鲁棒的迷因理解需要多模态训练，在语义类别间存在显著差异，且当模型在单个数据集上微调而非统一训练时，仍易受过度专业化影响。我们将公开实验资源与数据集以供学界使用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

30+阅读 · 2025年10月1日

多模态大型语言模型：综述

专知会员服务

47+阅读 · 2025年6月14日

《Med3DVLM：面向三维医学图像分析的高效视觉-语言模型》

专知会员服务

9+阅读 · 2025年3月27日

大型视觉语言模型中幻觉现象的综述

专知会员服务

47+阅读 · 2024年10月24日