PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Prompt injection poses serious security risks to real-world LLM applications, particularly autonomous agents. Although many defenses have been proposed, their robustness against adaptive attacks remains insufficiently evaluated, potentially creating a false sense of security. In this work, we propose PISmith, a reinforcement learning (RL)-based red-teaming framework that systematically assesses existing prompt-injection defenses by training an attack LLM to optimize injected prompts in a practical black-box setting, where the attacker can only query the defended LLM and observe its outputs. We find that directly applying standard GRPO to attack strong defenses leads to sub-optimal performance due to extreme reward sparsity -- most generated injected prompts are blocked by the defense, causing the policy's entropy to collapse before discovering effective attack strategies, while the rare successes cannot be learned effectively. In response, we introduce adaptive entropy regularization and dynamic advantage weighting to sustain exploration and amplify learning from scarce successes. Extensive evaluation on 13 benchmarks demonstrates that state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks. We also compare PISmith with 7 baselines across static, search-based, and RL-based attack categories, showing that PISmith consistently achieves the highest attack success rates. Furthermore, PISmith achieves strong performance in agentic settings on InjecAgent and AgentDojo against both open-source and closed-source LLMs (e.g., GPT-4o-mini and GPT-5-nano). Our code is available at https://github.com/albert-y1n/PISmith.

翻译：提示注入对现实世界的大语言模型应用，特别是自主智能体，构成了严重的安全威胁。尽管已有多种防御方案被提出，但其在自适应攻击下的鲁棒性尚未得到充分评估，可能导致虚假的安全感。本文提出PISmith，一种基于强化学习的红队测试框架，通过训练攻击性大语言模型在实际黑盒设置下优化注入提示，系统评估现有提示注入防御机制。在此设置中，攻击者仅能查询受防御的大语言模型并观察其输出。我们发现，直接应用标准GRPO攻击强防御方案会导致次优性能，原因是奖励极度稀疏——大多数生成的注入提示被防御机制拦截，导致策略的熵在发现有效攻击策略前过早坍缩，而罕见的成功案例又难以被有效学习。为此，我们引入了自适应熵正则化和动态优势加权，以维持探索并放大从稀少成功中学习的效果。在13个基准测试上的广泛评估表明，当前最先进的提示注入防御在面对自适应攻击时依然脆弱。我们还将PISmith与静态、基于搜索和基于强化学习三大类别的7种基线方法进行比较，结果显示PISmith始终取得最高的攻击成功率。此外，PISmith在InjecAgent和AgentDojo的智能体场景中，针对开源和闭源大语言模型（如GPT-4o-mini和GPT-5-nano）均表现出强大的攻击性能。代码已开源：https://github.com/albert-y1n/PISmith。