The task of retrieving already debunked narratives aims to detect stories that have already been fact-checked. The successful detection of claims that have already been debunked not only reduces the manual efforts of professional fact-checkers but can also contribute to slowing the spread of misinformation. Mainly due to the lack of readily available data, this is an understudied problem, particularly when considering the cross-lingual task, i.e. the retrieval of fact-checking articles in a language different from the language of the online post being checked. This paper fills this gap by (i) creating a novel dataset to enable research on cross-lingual retrieval of already debunked narratives, using tweets as queries to a database of fact-checking articles; (ii) presenting an extensive experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task; and (iii) proposing a novel multistage framework that divides this cross-lingual debunk retrieval task into refinement and re-ranking stages. Results show that the task of cross-lingual retrieval of already debunked narratives is challenging and off-the-shelf Transformer models fail to outperform a strong lexical-based baseline (BM25). Nevertheless, our multistage retrieval framework is robust, outperforming BM25 in most scenarios and enabling cross-domain and zero-shot learning, without significantly harming the model's performance.
翻译:已辟谣叙事的检索任务旨在检测已被事实核查过的故事。成功检测已被辟谣的声明不仅能减少专业事实核查人员的劳动量,还可有助于减缓错误信息的传播。由于缺乏现成可用数据,这一问题研究不足,尤其在跨语言任务中——即检索与被核查在线帖子语言不同的事实核查文章。本文通过以下方式填补这一空白:(i) 创建一个新数据集,以推文作为事实核查文章数据库的查询语句,促进跨语言已辟谣叙事检索的研究;(ii) 开展广泛实验,对比针对此任务的微调与现成多语言预训练Transformer模型的性能;(iii) 提出一种新型多阶段框架,将跨语言辟谣检索任务划分为精炼与重排序阶段。结果表明,跨语言已辟谣叙事检索任务具有挑战性,现成Transformer模型无法超越基于强基线的词汇模型(BM25)。然而,我们的多阶段检索框架具有鲁棒性,在多数场景下优于BM25,且实现跨域与零样本学习而不会显著损害模型性能。