This research examines the emerging technique of step-around prompt engineering in GenAI research, a method that deliberately bypasses AI safety measures to expose underlying biases and vulnerabilities in GenAI models. We discuss how Internet-sourced training data introduces unintended biases and misinformation into AI systems, which can be revealed through the careful application of step-around techniques. Drawing parallels with red teaming in cybersecurity, we argue that step-around prompting serves a vital role in identifying and addressing potential vulnerabilities while acknowledging its dual nature as both a research tool and a potential security threat. Our findings highlight three key implications: (1) the persistence of Internet-derived biases in AI training data despite content filtering, (2) the effectiveness of step-around techniques in exposing these biases when used responsibly, and (3) the need for robust safeguards against malicious applications of these methods. We conclude by proposing an ethical framework for using step-around prompting in AI research and development, emphasizing the importance of balancing system improvements with security considerations.
翻译:本研究探讨了生成式AI研究中新兴的绕行提示工程技术,该方法通过刻意规避AI安全措施来揭示生成式AI模型中潜在的偏见与脆弱性。我们分析了互联网来源的训练数据如何将非预期的偏见与错误信息引入AI系统,并阐释如何通过精细运用绕行技术揭示这些问题。通过类比网络安全领域的红队测试,我们认为绕行提示在识别和应对潜在脆弱性方面发挥着关键作用,同时需承认其兼具研究工具与潜在安全威胁的双重属性。我们的研究结果揭示了三个核心启示:(1) 尽管存在内容过滤机制,源自互联网的偏见仍持续存在于AI训练数据中;(2) 在负责任使用的前提下,绕行技术能有效暴露这些偏见;(3) 亟需建立针对这些方法恶意应用的强效防护机制。最后,我们提出了在AI研发中运用绕行提示的伦理框架,强调在系统改进与安全考量之间保持平衡的重要性。