Large language models (LLMs) have demonstrated remarkable capabilities across diverse applications, however, they remain critically vulnerable to jailbreak attacks that elicit harmful responses violating human values and safety guidelines. Despite extensive research on defense mechanisms, existing safeguards prove insufficient against sophisticated adversarial strategies. In this work, we propose iMIST (\underline{i}nteractive \underline{M}ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious queries as normal tool invocations to bypass content filters, while simultaneously introducing an interactive progressive optimization algorithm that dynamically escalates response harmfulness through multi-turn dialogues guided by real-time harmfulness assessment. Our experiments on widely-used models demonstrate that iMIST achieves higher attack effectiveness, while maintaining low rejection rates. These results reveal critical vulnerabilities in current LLM safety mechanisms and underscore the urgent need for more robust defense strategies.
翻译:大型语言模型(LLMs)已在多样化应用中展现出卓越能力,然而其仍极易受到越狱攻击的影响,这类攻击会诱使模型生成违背人类价值观与安全准则的有害回复。尽管针对防御机制已有广泛研究,现有防护措施在面对复杂对抗策略时仍显不足。本研究提出iMIST(交互式多步渐进工具伪装越狱攻击),这是一种新颖的自适应越狱方法,能协同利用当前防御机制中的漏洞。iMIST通过将恶意查询伪装为正常工具调用来绕过内容过滤器,同时引入交互式渐进优化算法,该算法在实时危害性评估引导下,通过多轮对话动态提升回复的危害程度。我们在广泛使用的模型上进行的实验表明,iMIST在保持较低拒绝率的同时实现了更高的攻击有效性。这些结果揭示了当前LLM安全机制中的关键漏洞,并凸显了对更鲁棒防御策略的迫切需求。