MultiClaimNet: A Massively Multilingual Dataset of Fact-Checked Claim Clusters

In the context of fact-checking, claims are often repeated across various platforms and in different languages, which can benefit from a process that reduces this redundancy. While retrieving previously fact-checked claims has been investigated as a solution, the growing number of unverified claims and expanding size of fact-checked databases calls for alternative, more efficient solutions. A promising solution is to group claims that discuss the same underlying facts into clusters to improve claim retrieval and validation. However, research on claim clustering is hindered by the lack of suitable datasets. To bridge this gap, we introduce \textit{MultiClaimNet}, a collection of three multilingual claim cluster datasets containing claims in 86 languages across diverse topics. Claim clusters are formed automatically from claim-matching pairs with limited manual intervention. We leverage two existing claim-matching datasets to form the smaller datasets within \textit{MultiClaimNet}. To build the larger dataset, we propose and validate an approach involving retrieval of approximate nearest neighbors to form candidate claim pairs and an automated annotation of claim similarity using large language models. This larger dataset contains 85.3K fact-checked claims written in 78 languages. We further conduct extensive experiments using various clustering techniques and sentence embedding models to establish baseline performance. Our datasets and findings provide a strong foundation for scalable claim clustering, contributing to efficient fact-checking pipelines.

翻译：在事实核查的背景下，声明经常在不同平台和不同语言中重复出现，这可以通过减少冗余的流程来获益。虽然检索先前已核查的声明已被研究作为一种解决方案，但未经验证的声明数量不断增长以及事实核查数据库规模的扩大，要求寻找替代的、更高效的解决方案。一种有前景的解决方案是将讨论相同基本事实的声明分组为聚类，以改进声明检索和验证。然而，声明聚类研究因缺乏合适的数据集而受到阻碍。为弥补这一差距，我们引入了 \textit{MultiClaimNet}，这是一个包含三个多语言声明聚类数据集的集合，涵盖不同主题的86种语言声明。声明聚类通过有限的干预，从声明匹配对中自动形成。我们利用两个现有的声明匹配数据集来构建 \textit{MultiClaimNet} 中的较小数据集。为构建更大的数据集，我们提出并验证了一种方法，涉及检索近似最近邻以形成候选声明对，并使用大型语言模型自动标注声明相似性。这个更大的数据集包含85.3K个以78种语言编写的事实核查声明。我们进一步使用各种聚类技术和句子嵌入模型进行了广泛实验，以建立基线性能。我们的数据集和发现为可扩展的声明聚类提供了坚实基础，有助于构建高效的事实核查流程。