Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.
翻译:仇恨与宣传性模因利用图像与文本间的相互作用来传达有害意图,这种意图无法仅通过任一单独模态揭示。尽管基于思考的多模态大语言模型(MLLMs)已推动视觉-语言理解取得进展,但其在模因内容审核中的应用仍未被充分探索。我们提出一种基于强化学习的后训练方法,通过任务特定奖励与组相对策略优化(GRPO),提升基于思考的MLLMs的分类性能与基于参考的解释质量。具体而言,我们:(i) 在英语与阿拉伯语基准上,对现成MLLMs进行仇恨与宣传性模因理解的系统性实证研究;(ii) 通过蒸馏与多LLM细粒度宣传标注,为现有模因数据集扩展弱监督的思维链(CoT)推理逻辑;(iii) 引入基于GRPO的含思维长度正则化的目标函数,联合优化分类准确性与解释质量;(iv) 基于共识伪标签,研究针对未标注模因的自监督GRPO方法。在Hateful Memes与ArMeme基准上的实验表明,我们的方法在FHM准确率(提升最多2.1%,从79.9%至82.0%)与ArMeme宏F1值(提升最多7.6个百分点,从0.536至含解释的0.612;相较于原始ArMeme基准提升6.1)上均优于先前报道的结果,同时生成自然语言解释。在ArMeme上,序列分类基线在原始准确率方面仍具优势,而我们的方法在提供更均衡的各类别性能的同时,还生成了解释。我们已公开发布代码、数据扩展与评估资源。