窥探幕后：利用绕行提示工程识别生成式AI模型中的偏见与错误信息 (A Peek Behind the Curtain: Using Step-Around Prompt Engineering to Identify Bias and Misinformation in GenAI Models)

This research examines the emerging technique of step-around prompt engineering in GenAI research, a method that deliberately bypasses AI safety measures to expose underlying biases and vulnerabilities in GenAI models. We discuss how Internet-sourced training data introduces unintended biases and misinformation into AI systems, which can be revealed through the careful application of step-around techniques. Drawing parallels with red teaming in cybersecurity, we argue that step-around prompting serves a vital role in identifying and addressing potential vulnerabilities while acknowledging its dual nature as both a research tool and a potential security threat. Our findings highlight three key implications: (1) the persistence of Internet-derived biases in AI training data despite content filtering, (2) the effectiveness of step-around techniques in exposing these biases when used responsibly, and (3) the need for robust safeguards against malicious applications of these methods. We conclude by proposing an ethical framework for using step-around prompting in AI research and development, emphasizing the importance of balancing system improvements with security considerations.

翻译：本研究探讨了生成式AI研究中新兴的绕行提示工程技术，该方法通过刻意规避AI安全措施来揭示生成式AI模型中潜在的偏见与脆弱性。我们分析了互联网来源的训练数据如何将非预期的偏见与错误信息引入AI系统，并阐释如何通过精细运用绕行技术揭示这些问题。通过类比网络安全领域的红队测试，我们认为绕行提示在识别和应对潜在脆弱性方面发挥着关键作用，同时需承认其兼具研究工具与潜在安全威胁的双重属性。我们的研究结果揭示了三个核心启示：(1) 尽管存在内容过滤机制，源自互联网的偏见仍持续存在于AI训练数据中；(2) 在负责任使用的前提下，绕行技术能有效暴露这些偏见；(3) 亟需建立针对这些方法恶意应用的强效防护机制。最后，我们提出了在AI研发中运用绕行提示的伦理框架，强调在系统改进与安全考量之间保持平衡的重要性。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

144页ppt《扩散模型》，Google DeepMind Sander Dieleman

专知会员服务

50+阅读 · 2025年11月21日

【新书】利用生成式人工智能进行网络防御策略

专知会员服务

31+阅读 · 2024年10月18日

大模型如何遗忘不良知识？最新《生成式人工智能中的机器遗忘》综述

专知会员服务

24+阅读 · 2024年8月1日

如何全面了解提示技术？马里兰大学等最新76页《提示报告：提示技术》系统综述

专知会员服务

40+阅读 · 2024年6月12日