Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.
翻译:大语言模型(LLMs)在各类任务中展现出卓越能力。然而,由于其对基于对抗性提示的攻击具有敏感性,其安全性仍是关键问题。本文提出UNIATTACK——一个从防御视角设计的对抗性测试框架,用于系统性构建有效的黑盒攻击提示。不同于依赖静态模板或迭代模型特定优化的先前方法,UNIATTACK从多种现有攻击中提取最小但高影响力的攻击特征,通过专用攻击者LLM优化,并利用自动化改进流程将其组合为灵活模板。这种以特征为中心的构建方式使单次攻击能泛化至多个模型和安全类别,为评估LLM鲁棒性提供实用工具。评估结果表明,与基线方法相比,UNIATTACK在部署多层防御机制的模型上平均攻击成功率(ASR)提升64.63%-248.82%,且成本仅为基线方法的0.03%-4.96%。UNIATTACK工具包可访问:https://anonymous.4open.science/r/UniAttack-Artifact-30F1。