The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\times$ higher discovery rate with 89\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.
翻译:随着大语言模型在安全关键应用中的日益广泛部署,系统性地评估其对抗性行为的鲁棒性面临根本性挑战。现有的红队测试实践主要依赖人工和专家驱动,这限制了其在高维提示空间中的可扩展性、可复现性和覆盖范围。我们将自动化大语言模型红队测试形式化为一个结构化的对抗性搜索问题,并提出一个学习驱动的框架以实现可扩展的漏洞发现。该方法结合了元提示引导的对抗性提示生成与分层执行和检测流水线,从而能够对六个代表性威胁类别进行标准化评估,包括奖励黑客攻击、欺骗性对齐、数据窃取、消极怠工、不当工具使用以及思维链操纵。在GPT-OSS-20B上进行的大量实验识别出47个漏洞,其中包括21个高严重性故障和12个先前未记录的攻撃模式。在匹配的查询预算下,与人工红队测试相比,我们的方法实现了3.9倍的更高发现率,并具有89%的检测准确率,证明其在大规模鲁棒性评估方面具有卓越的覆盖范围、效率和可复现性。