MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

翻译：大型语言模型（LLM）正开始重塑媒体从业者验证信息的方式，然而对检测具有核查价值主张（事实核查流程中的关键步骤）的自动化支持仍然有限。本文提出多语言核查价值（MultiCW）数据集，这是一个平衡的多语言基准数据集，涵盖16种语言、7个主题领域和2种写作风格，用于检测具有核查价值的主张。该数据集包含123,722个样本，在嘈杂（非正式）文本与结构化（正式）文本之间均匀分布，且所有语言中核查价值类别与非核查价值类别的表征均保持平衡。为探究模型鲁棒性，我们还引入一个同等平衡的分布外评估集，包含4种额外语言的27,761个样本。为提供基线，我们在零样本设置下对3种常见的微调多语言Transformer模型与15种不同的商业及开源LLM进行了基准测试。研究结果表明，在主张分类任务上，微调模型始终优于零样本LLM，并在跨语言、跨领域和跨风格场景中展现出强大的分布外泛化能力。MultiCW为推进自动化事实核查提供了严谨的多语言资源，并支持在核查价值主张检测任务上对微调模型与前沿LLM进行系统性比较。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2026】FinRpt：面向证券研究报告生成的数据集、评测体系与基于大语言模型的多智能体框架

专知会员服务

20+阅读 · 2025年11月11日

利用多个大型语言模型：关于LLM集成的调研

专知会员服务

35+阅读 · 2025年2月27日

多模态大规模语言模型基准的综述

专知会员服务

41+阅读 · 2024年8月25日

数据与多模态大型语言模型的协同作用综述

专知会员服务

59+阅读 · 2024年7月13日