Despite increasing awareness and research around fake news, there is still a significant need for datasets that specifically target racial slurs and biases within North American political speeches. This is particulary important in the context of upcoming North American elections. This study introduces a comprehensive dataset that illuminates these critical aspects of misinformation. To develop this fake news dataset, we scraped and built a corpus of 40,000 news articles about political discourses in North America. A portion of this dataset (4000) was then carefully annotated, using a blend of advanced language models and human verification methods. We have made both these datasets openly available to the research community and have conducted benchmarking on the annotated data to demonstrate its utility. We release the best-performing language model along with data. We encourage researchers and developers to make use of this dataset and contribute to this ongoing initiative.
翻译:摘要:尽管围绕虚假新闻的认知和研究日益增多,但针对北美政治演讲中种族歧视言论与偏见的数据集仍存在显著缺口。在即将到来的北美大选背景下,这一问题尤为突出。本研究引入了一个揭示虚假信息关键维度的综合性数据集。为构建该虚假新闻数据集,我们爬取并建立了包含4万篇北美政治话语新闻文章的语料库。通过结合先进语言模型与人工验证方法,我们对其中4000篇文章进行了精细标注。我们已向研究社区公开这两个数据集,并对标注数据开展了基准测试以验证其实用性。同时发布性能最优的语言模型及配套数据。我们鼓励研究人员和开发者使用该数据集,共同推进此项持续研究。