Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.
翻译:社交媒体上的仇恨内容日益以结合图像与文本来传递有害叙事的模态表情包形式出现。在孟加拉语等低资源语言中,由于标注数据有限、类别不平衡以及普遍存在的语码混合现象,自动化检测仍面临挑战。为解决这些问题,我们利用孟加拉语多模态攻击数据集(MIMOSA)中的语义对齐样本来增强孟加拉语仇恨表情包(BHM)数据集,从而改善了类别平衡和语义多样性。我们提出了增强双重协同注意力框架(xDORA),通过加权注意力池化集成视觉编码器(CLIP、DINOv2)和多语言文本编码器(XGLM、XLM-R),以学习鲁棒的跨模态表示。基于这些嵌入,我们开发了一个基于FAISS的k近邻分类器用于非参数推理,并引入了融合检索增强生成的DORA(RAG-Fused DORA),该框架融入了检索驱动的上下文推理。我们进一步在零样本、少样本和检索增强提示设置下评估了LLaVA。在扩展数据集上的实验表明,xDORA(CLIP + XLM-R)在仇恨表情包识别和目标实体检测任务上分别达到了0.78和0.71的宏平均F1分数,而RAG-Fused DORA将性能提升至0.79和0.74,相较于DORA基线取得了增益。基于FAISS的分类器表现具有竞争力,并通过语义相似性建模展示了对稀有类别的鲁棒性。相比之下,LLaVA在少样本设置下效果有限,仅在检索增强下有适度提升,凸显了未经微调的预训练视觉-语言模型在处理语码混合的孟加拉语内容时的局限性。这些发现证明了有监督、检索增强和非参数多模态框架在应对低资源仇恨言论检测中语言与文化复杂性方面的有效性。