Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks

Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.

翻译：虚假信息与假新闻已成为紧迫的社会挑战，推动了对可靠自动化检测方法的需求。先前研究强调情感是假新闻检测中的重要信号，既可通过分析假新闻关联的情感类型，也可利用情感与情绪特征进行分类。然而，这带来了安全漏洞，因为攻击者可能操纵情感以规避检测器，尤其是在大语言模型（LLMs）兴起的背景下。已有少数研究探索了由LLMs生成的对抗样本，但主要关注新闻发布者的写作风格等文体特征。因此，情感操纵这一关键漏洞在很大程度上尚未得到充分探索。本文研究了最先进的假新闻检测器在情感操纵下的鲁棒性。我们提出了AdSent——一种面向情感鲁棒性的检测框架，旨在确保对原始新闻文本与情感篡改后的新闻文本保持一致的真相预测。具体而言，我们（1）提出基于LLMs的受控情感对抗攻击方法，（2）分析情感偏移对检测性能的影响。实验表明，改变情感会严重影响假新闻检测模型的性能，这揭示了模型倾向于将中性文章判定为真实新闻，而非中性文章则常被归类为虚假内容。（3）我们提出了一种新颖的情感无关训练策略，以增强模型对此类扰动的鲁棒性。在三个基准数据集上的大量实验证明，AdSent在准确性与鲁棒性方面均显著优于现有基线方法，同时能有效泛化至未见数据集及对抗场景。