Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

翻译：仇恨与宣传性模因利用图像与文本间的相互作用来传达有害意图，这种意图无法仅通过任一单独模态揭示。尽管基于思考的多模态大语言模型（MLLMs）已推动视觉-语言理解取得进展，但其在模因内容审核中的应用仍未被充分探索。我们提出一种基于强化学习的后训练方法，通过任务特定奖励与组相对策略优化（GRPO），提升基于思考的MLLMs的分类性能与基于参考的解释质量。具体而言，我们：(i) 在英语与阿拉伯语基准上，对现成MLLMs进行仇恨与宣传性模因理解的系统性实证研究；(ii) 通过蒸馏与多LLM细粒度宣传标注，为现有模因数据集扩展弱监督的思维链（CoT）推理逻辑；(iii) 引入基于GRPO的含思维长度正则化的目标函数，联合优化分类准确性与解释质量；(iv) 基于共识伪标签，研究针对未标注模因的自监督GRPO方法。在Hateful Memes与ArMeme基准上的实验表明，我们的方法在FHM准确率（提升最多2.1%，从79.9%至82.0%）与ArMeme宏F1值（提升最多7.6个百分点，从0.536至含解释的0.612；相较于原始ArMeme基准提升6.1）上均优于先前报道的结果，同时生成自然语言解释。在ArMeme上，序列分类基线在原始准确率方面仍具优势，而我们的方法在提供更均衡的各类别性能的同时，还生成了解释。我们已公开发布代码、数据扩展与评估资源。