A novel hack involving Large Language Models (LLMs) has emerged, leveraging adversarial suffixes to trick models into generating perilous responses. This method has garnered considerable attention from reputable media outlets such as the New York Times and Wired, thereby influencing public perception regarding the security and safety of LLMs. In this study, we advocate the utilization of perplexity as one of the means to recognize such potential attacks. The underlying concept behind these hacks revolves around appending an unusually constructed string of text to a harmful query that would otherwise be blocked. This maneuver confuses the protective mechanisms and tricks the model into generating a forbidden response. Such scenarios could result in providing detailed instructions to a malicious user for constructing explosives or orchestrating a bank heist. Our investigation demonstrates the feasibility of employing perplexity, a prevalent natural language processing metric, to detect these adversarial tactics before generating a forbidden response. By evaluating the perplexity of queries with and without such adversarial suffixes using an open-source LLM, we discovered that nearly 90 percent were above a perplexity of 1000. This contrast underscores the efficacy of perplexity for detecting this type of exploit.
翻译:一种涉及大型语言模型(LLM)的新型黑客攻击手段已经出现,该方法利用对抗性后缀来欺骗模型生成危险响应。这一方法已引起《纽约时报》和《连线》等知名媒体的广泛关注,从而影响了公众对LLM安全性的认知。在本研究中,我们主张将困惑度作为识别此类潜在攻击的手段之一。这些黑客攻击的核心概念在于:在一个本会被拦截的有害查询后附加一段异常构造的文本字符串。这种操作会迷惑防护机制,并诱导模型生成被禁止的响应。此类场景可能导致向恶意用户提供制造爆炸物或策划银行抢劫的详细指令。我们的研究证明,使用困惑度这一常见的自然语言处理指标,在生成被禁止响应之前检测这些对抗性策略是可行的。通过使用开源LLM评估带有和不带有此类对抗性后缀的查询的困惑度,我们发现近90%的查询困惑度超过1000。这一对比凸显了困惑度在检测此类攻击中的有效性。