Lying Blindly: Bypassing ChatGPT's Safeguards to Generate Hard-to-Detect Disinformation Claims

As Large Language Models become more proficient, their misuse in coordinated disinformation campaigns is a growing concern. This study explores the capability of ChatGPT with GPT-3.5 to generate short-form disinformation claims about the war in Ukraine, both in general and on a specific event, which is beyond the GPT-3.5 knowledge cutoff. Unlike prior work, we do not provide the model with human-written disinformation narratives by including them in the prompt. Thus the generated short claims are hallucinations based on prior world knowledge and inference from the minimal prompt. With a straightforward prompting technique, we are able to bypass model safeguards and generate numerous short claims. We compare those against human-authored false claims on the war in Ukraine from ClaimReview, specifically with respect to differences in their linguistic properties. We also evaluate whether AI authorship can be differentiated by human readers or state-of-the-art authorship detection tools. Thus, we demonstrate that ChatGPT can produce realistic, target-specific disinformation claims, even on a specific post-cutoff event, and that they cannot be reliably distinguished by humans or existing automated tools.

翻译：随着大型语言模型能力日益增强，其在协同虚假信息活动中的滥用问题日益引发关注。本研究探讨了基于GPT-3.5的ChatGPT生成关于乌克兰战争的短篇虚假信息主张的能力，包括一般性主张和特定事件主张（该事件超出GPT-3.5的知识截止日期）。与先前研究不同，我们未在提示中包含人工撰写的虚假信息叙事，因此生成的简短主张完全基于模型先验世界知识和对最小化提示的推理所产生的幻觉。通过简单的提示技术，我们成功绕过了模型防护机制，生成了大量简短主张。我们将这些主张与ClaimReview中关于乌克兰战争的人工撰写虚假主张进行对比，特别关注其语言特性的差异。同时评估了人类读者或最先进的作者身份检测工具能否区分AI生成内容。研究表明，ChatGPT能够生成逼真且具有特定针对性的虚假信息主张（即使是针对知识截止日期后的特定事件），且这些主张无法被人类或现有自动化工具可靠识别。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日