Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.
翻译:大型语言模型(LLM)正开始重塑媒体从业者验证信息的方式,然而对检测具有核查价值主张(事实核查流程中的关键步骤)的自动化支持仍然有限。本文提出多语言核查价值(MultiCW)数据集,这是一个平衡的多语言基准数据集,涵盖16种语言、7个主题领域和2种写作风格,用于检测具有核查价值的主张。该数据集包含123,722个样本,在嘈杂(非正式)文本与结构化(正式)文本之间均匀分布,且所有语言中核查价值类别与非核查价值类别的表征均保持平衡。为探究模型鲁棒性,我们还引入一个同等平衡的分布外评估集,包含4种额外语言的27,761个样本。为提供基线,我们在零样本设置下对3种常见的微调多语言Transformer模型与15种不同的商业及开源LLM进行了基准测试。研究结果表明,在主张分类任务上,微调模型始终优于零样本LLM,并在跨语言、跨领域和跨风格场景中展现出强大的分布外泛化能力。MultiCW为推进自动化事实核查提供了严谨的多语言资源,并支持在核查价值主张检测任务上对微调模型与前沿LLM进行系统性比较。