As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.
翻译:随着提示越狱技术在大语言模型中日益受到关注,建立一套通用的研究范式以评估攻击强度,并构建基础模型以开展更精细的实验具有重要意义。本文提出了一种新颖方法,聚焦于一组对越狱提示天生更敏感的目标问题,旨在规避增强的大语言模型安全机制所带来的限制。通过设计和分析这些敏感问题,本文揭示了一种识别大语言模型漏洞的更有效方法,从而推动大语言模型安全性的进步。本研究不仅挑战了现有的越狱方法,还强化了大语言模型以抵御潜在利用。