Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.
翻译:仇恨模因检测对视觉语言模型而言仍是一项艰巨挑战,现有基准在结构上属于观测性基准——混淆修辞性仇恨机制与目标社区特征,阻碍了对模型脆弱性的因果评估。为此,我们提出了FBHM,一个基于系统化构建的功能性仇恨模因基准,沿两个正交维度组织:25种不同修辞功能与10个目标社区(总计5000个模因)。对当前最先进视觉语言模型的基准测试揭示了严重的泛化差距:在标准数据集上精度极高的模型,其在FBHM上的性能急剧下降至近乎随机水平,证明模型利用的是数据集特定启发式规则,而非稳健的多模态推理。为有效弥合这一差距,我们提出了LSV(可学习引导向量),一种超低数据量策略,通过仅在500个引导样本(50个独特基础模因)上施加因果干预目标,将FBHM性能提升约30个宏F1分数点,且在上下文学习和PEFT方法上均表现更优,同时不降低源域性能。