Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case was Kevin Roose's notable conversation with Bing, which elicited harmful outputs after extended interaction. This contrasts with simpler early jailbreaks that produced similar content more easily, raising the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures: Conversational Length (CL), which quantifies the conversation length used to obtain a specific response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user's instruction sequence leading to the response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimisation of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.
翻译:大型语言模型(LLMs)呈现出双重用途困境:它们既能实现有益应用,又潜藏着通过对话交互造成危害的可能性。尽管存在多种安全防护措施,先进的LLMs仍然存在脆弱性。一个标志性案例是凯文·鲁斯与Bing的著名对话,该对话在长时间交互后引发了有害输出。这与早期能更轻易产生类似内容的简单越狱攻击形成对比,从而引出一个问题:需要多少对话努力才能从LLMs中获取有害信息?我们提出两个衡量指标:对话长度(CL)——量化获取特定响应所需的对话长度,以及对话复杂度(CC)——定义为导致该响应的用户指令序列的柯尔莫哥洛夫复杂度。针对柯尔莫哥洛夫复杂度的不可计算性,我们采用参考LLM来估计用户指令的可压缩性,从而近似计算CC。将此方法应用于大规模红队测试数据集,我们通过分析有害与无害对话长度及复杂度的统计分布进行定量研究。实证结果表明,这种分布分析及CC最小化可作为理解AI安全性的重要工具,为有害信息的可获取性提供见解。本研究为以危害路径的算法复杂度为核心的新LLM安全视角奠定了基础。