Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

Recent breakthroughs in natural language processing (NLP) have permitted the synthesis and comprehension of coherent text in an open-ended way, therefore translating the theoretical algorithms into practical applications. The large language models (LLMs) have significantly impacted businesses such as report summarization software and copywriters. Observations indicate, however, that LLMs may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility. Large-scale benchmarks for accountable LLMs should consequently be developed. Although several empirical investigations reveal the existence of a few ethical difficulties in advanced LLMs, there is little systematic examination and user study of the risks and harmful behaviors of current LLM usage. To further educate future efforts on constructing ethical LLMs responsibly, we perform a qualitative research method called ``red teaming'' on OpenAI's ChatGPT\footnote{In this paper, ChatGPT refers to the version released on Dec 15th.} to better understand the practical features of ethical dangers in recent LLMs. We analyze ChatGPT comprehensively from four perspectives: 1) \textit{Bias} 2) \textit{Reliability} 3) \textit{Robustness} 4) \textit{Toxicity}. In accordance with our stated viewpoints, we empirically benchmark ChatGPT on multiple sample datasets. We find that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies. In addition, we examine the implications of our findings on AI ethics and harmal behaviors of ChatGPT, as well as future problems and practical design considerations for responsible LLMs. We believe that our findings may give light on future efforts to determine and mitigate the ethical hazards posed by machines in LLM applications.

翻译：自然语言处理（NLP）领域的最新突破使得生成和理解连贯文本的开放式方法成为可能，从而将理论算法转化为实际应用。大型语言模型（LLMs）已显著影响了报告摘要软件和文案撰写等行业。然而，观察表明，LLMs可能表现出社会偏见和毒性，由此引发不负责任行为带来的伦理与社会风险。因此，亟需开发面向负责任LLMs的大规模基准测试。尽管若干实证研究揭示了高级LLMs中存在部分伦理问题，但目前对当前LLM使用中的风险及有害行为缺乏系统性考察与用户研究。为进一步指导未来构建负责任伦理LLMs的研究，我们采用名为“红队测试”（red teaming）的定性研究方法，针对OpenAI的ChatGPT（本文中指2023年12月15日发布的版本）展开分析，以深入理解近期LLMs中伦理风险的实际特征。我们从四个维度对ChatGPT进行全面评估：1）偏见（Bias）、2）可靠性（Reliability）、3）鲁棒性（Robustness）、4）毒性（Toxicity）。基于上述视角，我们在多个样本数据集上对ChatGPT进行实证基准测试。研究发现，大量伦理风险无法通过现有基准测试覆盖，因此我们通过额外案例研究加以阐释。此外，我们探讨了研究结果对人工智能伦理与ChatGPT有害行为的启示，以及负责任LLMs面临的未来挑战与实践设计考量。我们相信，这些发现可为后续识别并缓解LLM应用中机器引发的伦理风险提供参考。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日