SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks

Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier Large Language Models (LLMs) to train local, specialized language models. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up human-led fact-checking at a fraction of the cost of human-annotated data. Training on SynDy's generated labels shows improvement over a standard baseline and is not significantly worse compared to training on human labels (which may be infeasible to acquire). SynDy is being integrated into Meedan's chatbot tiplines that are used by over 50 organizations, serve over 230K users annually, and automatically distribute human-written fact-checks via messaging apps such as WhatsApp. SynDy will also be integrated into our deployed Co-Insights toolkit, enabling low-resource organizations to launch tiplines for their communities. Finally, we envision SynDy enabling additional fact-checking tools such as matching new misinformation claims to high-quality explainers on common misinformation topics.

翻译：海外侨民社群受到隐性虚假信息的不成比例影响，且往往被主流事实核查工作忽视，因此亟需扩大新兴事实核查行动规模。本文提出SynDy——一种合成动态数据集生成框架，旨在利用最大规模前沿大语言模型（LLMs）的能力来训练本地化专业语言模型。据我们所知，SynDy是首个利用LLMs为与虚假信息缓解直接相关的任务（即声明匹配、主题聚类和声明关系分类）创建细粒度合成标签的研究。SynDy通过LLMs与社交媒体查询，自动生成面向这三个任务的远程监督型主题聚焦数据集及合成标签，以人工标注数据的一小部分成本提供规模化扩展人工事实核查的关键工具。基于SynDy生成标签的训练表现优于标准基线，且与基于人工标签（可能难以获取）的训练相比无显著劣势。SynDy正被整合至Meedan的聊天机器人举报热线系统——该系统被超过50个组织使用，每年服务逾23万用户，并通过WhatsApp等即时通讯应用自动分发人工撰写的事实核查内容。SynDy还将集成至我们已部署的Co-Insights工具包，助力资源匮乏组织为其社群搭建举报热线。最后，我们展望SynDy可推动更多事实核查工具的发展，例如将新出现的虚假信息声明与常见虚假信息主题的高质量解释文章相匹配。