Fact-checking systems with search-enabled large language models (LLMs) have shown strong potential for verifying claims by dynamically retrieving external evidence. However, the robustness of such systems against adversarial attack remains insufficiently understood. In this work, we study adversarial claim attacks against search-enabled LLM-based fact-checking systems under a realistic input-only threat model. We propose DECEIVE-AFC, an agent-based adversarial attack framework that integrates novel claim-level attack strategies and adversarial claim validity evaluation principles. DECEIVE-AFC systematically explores adversarial attack trajectories that disrupt search behavior, evidence retrieval, and LLM-based reasoning without relying on access to evidence sources or model internals. Extensive evaluations on benchmark datasets and real-world systems demonstrate that our attacks substantially degrade verification performance, reducing accuracy from 78.7% to 53.7%, and significantly outperform existing claim-based attack baselines with strong cross-system transferability.
翻译:支持搜索的大语言模型(LLM)的事实核查系统通过动态检索外部证据来验证声明,已展现出巨大潜力。然而,此类系统对抗对抗性攻击的鲁棒性仍未得到充分理解。在本研究中,我们在一个现实的仅输入威胁模型下,研究针对支持搜索的基于LLM的事实核查系统的对抗性声明攻击。我们提出了DECEIVE-AFC,一个基于智能体的对抗性攻击框架,该框架集成了新颖的声明级攻击策略和对抗性声明有效性评估原则。DECEIVE-AFC系统地探索了对抗性攻击轨迹,这些轨迹扰乱了搜索行为、证据检索和基于LLM的推理,且无需依赖对证据源或模型内部信息的访问。在基准数据集和真实世界系统上进行的大量评估表明,我们的攻击显著降低了验证性能,将准确率从78.7%降至53.7%,并且显著优于现有的基于声明的攻击基线,同时展现出强大的跨系统可迁移性。