Large Language Models (LLMs) constitute pivotal components within the AI-dominated information technology ecosystem. To mitigate risks associated with harmful or policy-violating outputs, commercial systems employ advanced alignment strategies and multi-layered content moderation mechanisms. Despite these safeguards, recent research has demonstrated that LLMs remain vulnerable to adversarial manipulation, particularly through jailbreaking and prompt injection techniques. In this work, we propose GAS-Leak-LLM a novel jailbreaking attack based on a genetic algorithm that systematically evolves adversarial suffix to bypass safety constraints. Operating in a strict black-box setting, our method requires no access to model parameters or internals, thereby reflecting realistic threat scenarios in deployed systems. Through the iterative application of selection, mutation, and crossover heuristics, the framework systematically explores the discrete prompt space to identify high-fitness adversarial suffixes. Empirical findings reveal critical shortcomings in existing safety enforcement mechanisms and confirm the effectiveness and practical viability of the proposed attack.
翻译:大语言模型是人工智能主导的信息技术生态系统中的关键组成部分。为降低有害或违反政策输出的风险,商业系统采用了先进的对齐策略与多层次内容审核机制。尽管存在这些防护措施,近期研究表明,大语言模型仍易受对抗性操纵,尤其是通过越狱攻击和提示注入技术。本文提出GAS-Leak-LLM——一种基于遗传算法的新型越狱攻击方法,通过系统性演化对抗性后缀以绕过安全约束。该方法在严格的黑盒设置下运行,无需访问模型参数或内部结构,从而反映部署系统中的真实威胁场景。通过迭代应用选择、变异与交叉启发式策略,该框架系统性探索离散提示空间,以识别高适应度的对抗性后缀。实验发现揭示了现有安全执行机制的关键缺陷,并证实了所提攻击方法的有效性与实际可行性。