Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection

Misinformation can seriously impact society, affecting anything from public opinion to institutional confidence and the political horizon of a state. Fake News (FN) proliferation on online websites and Online Social Networks (OSNs) has increased profusely. Various fact-checking websites include news in English and barely provide information about FN in regional languages. Thus the Urdu FN purveyors cannot be discerned using factchecking portals. SOTA approaches for Fake News Detection (FND) count upon appropriately labelled and large datasets. FND in regional and resource-constrained languages lags due to the lack of limited-sized datasets and legitimate lexical resources. The previous datasets for Urdu FND are limited-sized, domain-restricted, publicly unavailable and not manually verified where the news is translated from English into Urdu. In this paper, we curate and contribute the first largest publicly available dataset for Urdu FND, Ax-to-Grind Urdu, to bridge the identified gaps and limitations of existing Urdu datasets in the literature. It constitutes 10,083 fake and real news on fifteen domains collected from leading and authentic Urdu newspapers and news channel websites in Pakistan and India. FN for the Ax-to-Grind dataset is collected from websites and crowdsourcing. The dataset contains news items in Urdu from the year 2017 to the year 2023. Expert journalists annotated the dataset. We benchmark the dataset with an ensemble model of mBERT,XLNet, and XLM RoBERTa. The selected models are originally trained on multilingual large corpora. The results of the proposed model are based on performance metrics, F1-score, accuracy, precision, recall and MCC value.

翻译：虚假信息可能严重冲击社会，从公众舆论、机构信任到国家的政治格局皆受影响。在线网站与社交网络上的虚假新闻（FN）呈泛滥之势。各类事实核查网站多聚焦英语新闻，极少提供地区性语言的虚假新闻信息。因此，乌尔都语虚假新闻的传播者无法通过事实核查平台被识别。当前最先进的虚假新闻检测（FND）方法依赖于充分标注的大规模数据集。受限于资源匮乏语言中数据集规模不足与合法词汇资源稀缺，地区性语言的虚假新闻检测发展滞后。现有乌尔都语虚假新闻数据集存在规模小、领域受限、未公开且未经人工验证（新闻由英语翻译为乌尔都语）等问题。本文为填补文献中现有乌尔都语数据集的上述空白与局限，首次构建并贡献了最大规模的公开乌尔都语虚假新闻检测数据集——Ax-to-Grind Urdu。该数据集涵盖2017年至2023年间从巴基斯坦与印度主流权威乌尔都语报纸及新闻网站收集的15个领域共10,083条真实与虚假新闻。其中的虚假新闻通过网站爬取与众包方式获取。数据集由资深记者完成标注。我们采用mBERT、XLNet与XLM-RoBERTa的集成模型对该数据集进行基准测试。所选模型均基于多语言大规模语料预训练。最终结果以F1分数、准确率、精确率、召回率与马修斯相关系数（MCC）等性能指标呈现。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日