Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs, including those from long-tail distributions such as low-resource languages and encrypted private data. This open-ended exposure increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic evaluation of these security and privacy vulnerabilities. In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search. EvoJail formulates long-tail attack prompt generation as a multi-objective optimization problem that jointly maximizes attack effectiveness and minimizes output perplexity, and introduces a semantic-algorithmic solution representation to capture both high-level semantic intent and low-level structural transformations of encryption-decryption logic. Building upon this representation, EvoJail integrates LLM-assisted operators into a multi-objective evolutionary framework, enabling adaptive and semantically informed mutation and crossover for efficiently exploring a highly structured and open-ended search space. Extensive experiments demonstrate that EvoJail consistently discovers diverse and effective long-tail jailbreak strategies, achieving competitive performance with existing methods in both individual and ensemble level.
翻译:大语言模型(LLMs)已广泛部署,尤其通过免费网络应用暴露于多样化的用户生成输入,包括低资源语言和加密私密数据等长尾分布输入。这种开放式暴露增加了破坏模型安全对齐的越狱攻击风险。尽管近期研究表明利用长尾分布可促成此类越狱,但现有方法大多依赖人工设计规则,限制了系统化评估这些安全与隐私漏洞的能力。本文提出EvoJail——一种通过多目标进化搜索自动发现长尾分布攻击的框架。EvoJail将长尾攻击提示生成建模为多目标优化问题,同步最大化攻击效能与最小化输出困惑度,并引入语义-算法联合解表征,以捕捉高层语义意图与加解密逻辑的低层结构变换。基于该表征,EvoJail将大语言模型辅助算子集成至多目标进化框架,实现自适应且语义指导的变异与交叉操作,高效探索高度结构化且开放式的搜索空间。大量实验表明,EvoJail能持续发现多样且有效的长尾越狱策略,在个体与集成层面均达到与现有方法竞争的性能。