Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

Internet memes have become a dominant form of expression on social media, including within the Bengali speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is exceptionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource languages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyses both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task. To facilitate reproducibility and future research, the Bn-HIB dataset has been made publicly available through Mendeley Data. Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised

翻译：互联网模因已成为社交媒体上的主流表达形式，包括在孟加拉语社群中。此类内容虽常具幽默色彩，却也可能被滥用来传播针对个人或群体的攻击性、有害及煽动性信息。由于模因具有讽刺性、隐蔽性及文化特异性，检测这类内容极具挑战性。这一问题在孟加拉语等低资源语言中更为突出，因为现有研究主要集中于高资源语言。为填补这一关键研究空白，我们提出Bn-HIB（孟加拉语仇恨-煽动-良性）数据集，包含3,247条经过人工标注的孟加拉语模因，分为良性、仇恨或煽动三类。值得注意的是，Bn-HIB是首个在孟加拉语模因中区分煽动性内容与直接仇恨言论的数据集。此外，我们提出多模态协同注意力融合模型（MCFM），该架构简洁高效，可同时分析模因中的视觉与文本元素。MCFM采用协同注意力机制，从每种模态中识别并融合最关键特征，从而实现更精准的分类。实验表明，MCFM在Bn-HIB数据集上显著优于多个现有最优模型，验证了其在精细任务中的有效性。为促进可复现性与后续研究，Bn-HIB数据集已在Mendeley Data平台公开。警告：本项研究包含可能令部分受众不适的内容，请读者慎重判断。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型越狱攻击：模型、根因及其攻防演化

专知会员服务

22+阅读 · 2025年4月28日

《利用大型语言模型检测社交平台上的网络欺凌行为》

专知会员服务

45+阅读 · 2024年1月23日

《对齐语言模型的通用和可转移对抗性攻击》CMU等2023最新论文

专知会员服务

26+阅读 · 2024年1月2日

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

专知会员服务

111+阅读 · 2023年12月19日