Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.
翻译:通过攻击大型语言模型(LLMs)来刻意生成异常输出是一种新兴的人类活动。本文基于广泛而多样的证据,对人们实施此类攻击的方式与动因进行了系统阐述,并据此界定了LLM红队测试的概念。采用规范的定性研究方法,我们访谈了数十位来自多元背景的实践者,他们均致力于这项旨在诱发LLM失效的创新工作。研究聚焦于三个核心问题:界定LLM红队测试的定义、揭示实施该活动的动机与目标、以及归纳人们攻击LLMs时所采用的策略体系。基于实证数据,我们将LLM红队测试定义为一种具有极限探索性、非恶意性、高度依赖团队协作与"炼金术士"思维模式的人工操作活动。其内在驱动力主要源于好奇心、趣味性,并在一定程度上涉及对部署LLMs可能引发各类危害的关切。我们构建了一个包含12种策略和35种具体技术的LLM攻击方法分类体系。这些发现共同构成了一套关于人们如何及为何攻击大型语言模型的完整扎根理论:LLM红队测试。