Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.
翻译:自动红队测试框架在大语言模型领域已日趋复杂,但许多方法仍主要在提示空间中进行攻击优化。换言之,这些方法主要寻找更优的攻击措辞或策略选择,却未在可执行代码空间内进行搜索。通过将搜索迁移至代码空间,我们不仅可优化最终攻击提示,还能优化生成攻击提示的流程,涵盖执行流、可复用逻辑、分支结构及故障驱动修复。为克服此局限,我们提出EvoSynth——一种将优化空间从提示转向可执行代码的自主多智能体框架。EvoSynth并非直接优化提示,而是采用多智能体系统自主设计、进化并执行基于代码的攻击算法。其关键在于内置代码级自纠错循环,能根据目标模型反馈和失败尝试迭代重写基于代码的算法。通过大量实验证明,EvoSynth在面对Claude-Sonnet-4.5等高度鲁棒模型时实现了85.5%的攻击成功率(ASR),在评估目标上的平均ASR达95.9%,同时生成的攻击多样性显著优于现有方法。我们公开该框架以促进可执行代码空间中进化合成的未来研究。