With the pandemic of COVID-19, relevant fake news is spreading all over the sky throughout the social media. Believing in them without discrimination can cause great trouble to people's life. However, universal language models may perform weakly in these fake news detection for lack of large-scale annotated data and sufficient semantic understanding of domain-specific knowledge. While the model trained on corresponding corpora is also mediocre for insufficient learning. In this paper, we propose a novel transformer-based language model fine-tuning approach for these fake news detection. First, the token vocabulary of individual model is expanded for the actual semantics of professional phrases. Second, we adapt the heated-up softmax loss to distinguish the hard-mining samples, which are common for fake news because of the disambiguation of short text. Then, we involve adversarial training to improve the model's robustness. Last, the predicted features extracted by universal language model RoBERTa and domain-specific model CT-BERT are fused by one multiple layer perception to integrate fine-grained and high-level specific representations. Quantitative experimental results evaluated on existing COVID-19 fake news dataset show its superior performances compared to the state-of-the-art methods among various evaluation metrics. Furthermore, the best weighted average F1 score achieves 99.02%.
翻译:随着COVID-19疫情的蔓延,相关假新闻在社交媒体上广泛传播。不加甄别地相信这些假新闻可能给人们的生活带来巨大困扰。然而,由于缺乏大规模标注数据和领域知识的充分语义理解,通用语言模型在假新闻检测中表现较弱。而在相应语料库上训练的模型也因学习不充分而表现平庸。本文提出了一种新颖的基于Transformer的语言模型微调方法用于此类假新闻检测。首先,扩展个体模型的词汇表以适配专业短语的实际语义。其次,我们采用升温softmax损失函数来区分难分样本——由于短文本消歧,这类样本在假新闻中普遍存在。然后,我们引入对抗训练以提升模型的鲁棒性。最后,通过多层感知机融合通用语言模型RoBERTa和领域特定模型CT-BERT提取的预测特征,以整合细粒度与高层级特定表征。在现有COVID-19假新闻数据集上的定量实验结果表明,该方法在各种评估指标上均优于最先进方法。此外,最佳加权平均F1分数达到99.02%。