Despite extensive safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. However, existing methods generally lack the capability for continuous learning and self-evolution from interactions, limiting the diversity and adaptability of attack strategies. To address this, we propose ASTRA, an automated framework capable of autonomously discovering, retrieving, and evolving attack strategies. ASTRA operates on a closed-loop ``attack-evaluate-distill-reuse'' mechanism, which not only generates attack prompts but also automatically distills reusable strategies from every interaction. To systematically manage these strategies, we introduce a dynamic three-tier strategy library (Effective, Promising, and Ineffective) that categorizes strategies based on performance. This hierarchical memory mechanism enables the framework to enhance efficiency by leveraging successful patterns while optimizing the exploration space by avoiding known failures. Extensive experiments in a black-box setting demonstrate that ASTRA significantly outperforms existing baselines.
翻译:尽管经过广泛的安全对齐,大型语言模型(LLM)仍易受越狱攻击。然而,现有方法普遍缺乏从交互中持续学习与自我演化的能力,限制了攻击策略的多样性与适应性。为此,我们提出ASTRA——一种能够自主发现、检索并演化攻击策略的自动化框架。ASTRA运行于闭环的“攻击-评估-蒸馏-复用”机制:不仅生成攻击提示,还能从每次交互中自动蒸馏出可复用的策略。为系统管理这些策略,我们引入动态的三级策略库(有效策略、潜力策略与无效策略),依据表现对策略进行分类。这种层次化记忆机制使框架能够通过利用成功模式提升效率,同时通过规避已知失败优化探索空间。在黑盒场景下的广泛实验表明,ASTRA显著优于现有基线方法。