Multimodal hateful content detection is a challenging task that requires complex reasoning across visual and textual modalities. Therefore, creating a meaningful multimodal representation that effectively captures the interplay between visual and textual features through intermediate fusion is critical. Conventional fusion techniques are unable to attend to the modality-specific features effectively. Moreover, most studies exclusively concentrated on English and overlooked other low-resource languages. This paper proposes a context-aware attention framework for multimodal hateful content detection and assesses it for both English and non-English languages. The proposed approach incorporates an attention layer to meaningfully align the visual and textual features. This alignment enables selective focus on modality-specific features before fusing them. We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English). Evaluation results demonstrate our proposed approach's effectiveness with F1-scores of $69.7$% and $70.3$% for the MUTE and MultiOFF datasets. The scores show approximately $2.5$% and $3.2$% performance improvement over the state-of-the-art systems on these datasets. Our implementation is available at https://github.com/eftekhar-hossain/Bengali-Hateful-Memes.
翻译:多模态仇恨内容检测是一项需要跨视觉和文本模态进行复杂推理的挑战性任务。因此,通过中间融合创建能有效捕获视觉与文本特征交互的有意义多模态表征至关重要。传统融合技术无法有效关注模态特定特征,且现有研究大多聚焦英语,忽视了其他低资源语言。本文提出一种面向多模态仇恨内容检测的上下文感知注意力框架,并针对英语及非英语语言进行评测。该方法引入注意力层实现视觉与文本特征的有意义对齐,使模型在融合前能够选择性关注模态特定特征。我们在两个基准仇恨模因数据集——MUTE(孟加拉语代码混合)和MultiOFF(英语)上进行评估。结果表明该方法有效性显著,在MUTE和MultiOFF数据集上的F1分数分别达到69.7%和70.3%,较现有最优系统性能提升约2.5%和3.2%。实现代码已开源在https://github.com/eftekhar-hossain/Bengali-Hateful-Memes。