ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

Ricardo Campos,Raquel Sequeira,Sara Nerea,Inês Cantante,Diogo Folques,Luís Filipe Cunha,João Canavilhas,António Branco,Alípio Jorge,Sérgio Nunes,Nuno Guimarães,Purificação Silvano

Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the misinformation itself; accelerating corrections through automation can therefore help counter it more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claims is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. Portuguese, like other languages, still lacks accessible, licensed datasets, limiting research, NLP developments and applications. In this paper, we introduce ClaimPT, a dataset of European Portuguese news articles annotated for factual claims, comprising 1,308 articles and 6,875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure annotation quality, two trained annotators labeled each article, with a curator validating all annotations according to a newly proposed scheme. We also provide baseline models for claim detection, establishing initial benchmarks and enabling future NLP and IR applications. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.

翻译：事实核查仍然是一项耗时费力的任务，很大程度上仍依赖人工验证，无法匹配在线虚假信息的快速传播。这一点尤为重要，因为辟谣信息通常比虚假信息本身需要更长时间才能触达受众；因此通过自动化加速纠错过程可以更有效地对抗虚假信息。尽管许多组织进行人工事实核查，但面对日益增长的数字内容体量，这种方法难以扩展。这些局限性促使人们关注事实核查自动化，其中识别主张是至关重要的第一步。然而，不同语言的研究进展并不均衡，由于英语标注数据丰富，该领域长期由英语主导。葡萄牙语与其他语言类似，仍缺乏可公开获取的授权数据集，这限制了相关研究、自然语言处理技术发展和实际应用。本文提出ClaimPT——一个针对事实主张进行标注的欧洲葡萄牙语新闻文章数据集，包含1,308篇文章和6,875条独立标注。与大多数基于社交媒体或议会记录文本的现有资源不同，ClaimPT聚焦于新闻内容，通过与葡萄牙新闻社LUSA的合作进行采集。为确保标注质量，由两名经过培训的标注员对每篇文章进行标注，并由审核员根据新提出的标注方案对所有标注进行验证。我们还提供了主张检测的基线模型，建立了初步基准，为未来自然语言处理和信息检索应用奠定基础。通过发布ClaimPT，我们旨在推动低资源语言事实核查研究，并深化对新闻媒体中虚假信息的理解。