Intentionally luring readers to click on a particular content by exploiting their curiosity defines a title as clickbait. Although several studies focused on detecting clickbait titles in English articles, low resource language like Bangla has not been given adequate attention. To tackle clickbait titles in Bangla, we have constructed the first Bangla clickbait detection dataset containing 15,056 labeled news articles and 65,406 unlabelled news articles extracted from clickbait dense news sites. Each article has been labeled by three expert linguists and includes an article's title, body, and other metadata. By incorporating labeled and unlabelled data, we finetune a pretrained Bangla transformer model in an adversarial fashion using Semi Supervised Generative Adversarial Networks (SS GANs). The proposed model acts as a good baseline for this dataset, outperforming traditional neural network models (LSTM, GRU, CNN) and linguistic feature based models. We expect that this dataset and the detailed analysis and comparison of these clickbait detection models will provide a fundamental basis for future research into detecting clickbait titles in Bengali articles. We have released the corresponding code and dataset.
翻译:利用读者好奇心理诱使其点击特定内容的行为,使标题被定义为“点击诱饵”。尽管已有研究集中于检测英文文章中的点击诱饵标题,但孟加拉语等低资源语言尚未得到充分关注。为应对孟加拉语中的点击诱饵标题,我们构建了首个孟加拉语点击诱饵检测数据集,包含从高密度点击诱饵新闻网站提取的15,056篇已标注新闻文章和65,406篇未标注新闻文章。每篇文章由三位语言学专家标注,涵盖标题、正文及其他元数据。通过整合已标注与未标注数据,我们采用半监督生成对抗网络(SS GANs)以对抗方式微调预训练的孟加拉语Transformer模型。该模型作为本数据集的强基线,性能优于传统神经网络模型(LSTM、GRU、CNN)及基于语言特征的模型。我们预期,该数据集及对这些点击诱饵检测模型的详细分析与比较,将为未来孟加拉语文章点击诱饵标题检测研究提供基础支撑。相关代码与数据集已公开发布。