The recent surge in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs. While earlier research has primarily concentrated on increasing the success rates of jailbreaking attacks, the underlying mechanism for safeguarding LLMs remains underexplored. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by LLMs. Such ``safety patterns'' can be identified with only a few pairs of contrastive queries in a simple method and function as ``keys'' (used as a metaphor for security defense capability) that can be used to open or lock Pandora's Box of LLMs. Extensive experiments demonstrate that the robustness of LLMs against jailbreaking can be lessened or augmented by attenuating or strengthening the identified safety patterns. These findings deepen our understanding of jailbreaking phenomena and call for the LLM community to address the potential misuse of open-source LLMs.
翻译:近期涌现的越狱方法揭示了大型语言模型(LLMs)对恶意输入的脆弱性。尽管早期研究主要集中于提升越狱攻击的成功率,但保护LLMs的内在机制仍未得到充分探索。本研究通过揭示LLMs在表征空间中生成的特异性活动模式,探究安全对齐LLMs的脆弱性。此类"安全模式"仅需通过简单方法中的少量对比查询对即可识别,并可作为开启或锁闭LLMs"潘多拉魔盒"的"密钥"(喻指安全防御能力)。大量实验表明,通过削弱或增强已识别的安全模式,可相应降低或提升LLMs抵御越狱攻击的鲁棒性。这些发现深化了我们对越狱现象的理解,并呼吁LLM研究界关注开源LLMs可能被滥用的潜在风险。