Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,334 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically identify and track the narratives spread within online ecosystems. Identifying 52,036 narratives on these 1,334 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and amplify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and to aid fact-checkers in more quickly addressing misinformation. We release code and data at https://github.com/hanshanley/specious-sites.
翻译:虚假信息、宣传和彻头彻尾的谎言在网络中泛滥,其中一些叙事在公共健康、选举和个人安全方面产生了危险的真实世界后果。然而,尽管虚假信息影响深远,研究界在很大程度上仍缺乏自动化和程序化的方法来跨在线平台追踪新闻叙事。在这项工作中,利用对1334个不可靠新闻网站的每日爬取、大型语言模型MPNet以及DP-Means聚类,我们引入了一个自动识别和追踪在线生态系统中传播叙事的系统。通过识别这1334个网站上的52036个叙事,我们描述了2022年传播最广泛的叙事,并识别出最具影响力的源头和放大叙事的网站。最后,我们展示了系统如何用于检测源自不可靠新闻网站的新叙事,以及如何帮助事实核查员更快地应对虚假信息。我们在 https://github.com/hanshanley/specious-sites 发布了代码和数据。