Mis- and disinformation are a substantial global threat to our security and safety. To cope with the scale of online misinformation, researchers have been working on automating fact-checking by retrieving and verifying against relevant evidence. However, despite many advances, a comprehensive evaluation of the possible attack vectors against such systems is still lacking. Particularly, the automated fact-verification process might be vulnerable to the exact disinformation campaigns it is trying to combat. In this work, we assume an adversary that automatically tampers with the online evidence in order to disrupt the fact-checking model via camouflaging the relevant evidence or planting a misleading one. We first propose an exploratory taxonomy that spans these two targets and the different threat model dimensions. Guided by this, we design and propose several potential attack methods. We show that it is possible to subtly modify claim-salient snippets in the evidence and generate diverse and claim-aligned evidence. Thus, we highly degrade the fact-checking performance under many different permutations of the taxonomy's dimensions. The attacks are also robust against post-hoc modifications of the claim. Our analysis further hints at potential limitations in models' inference when faced with contradicting evidence. We emphasize that these attacks can have harmful implications on the inspectable and human-in-the-loop usage scenarios of such models, and conclude by discussing challenges and directions for future defenses.
翻译:虚假与误导信息对全球安全构成重大威胁。为应对在线错误信息的规模效应,研究者致力于通过检索并验证相关证据来自动化事实核查。然而,尽管取得诸多进展,针对此类系统潜在攻击向量的全面评估仍属空白。特别是,自动化事实验证流程可能恰恰易受其试图对抗的虚假信息活动影响。本研究假设存在恶意攻击者,通过伪装相关证据或植入误导性证据,自动篡改在线证据以破坏事实核查模型。我们首先提出一个探索性分类法,涵盖上述两种攻击目标及不同威胁模型维度。基于该分类法,我们设计并提出了若干潜在攻击方法。研究表明,通过微妙修改证据中与主张相关的显著片段,可生成多样化且与主张对齐的证据。因此,在分类法维度的多种排列组合下,事实核查性能被显著降低。这些攻击对事后主张修改亦具有鲁棒性。进一步分析揭示了模型在面对矛盾证据时推理能力的潜在局限。我们强调,此类攻击对模型的可审查性与人在回路使用场景可能造成有害影响,并最终通过讨论未来防御面临的挑战与方向作为结论。