Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
翻译:提示注入是LLM智能体中最关键的漏洞之一;然而,从优化视角出发,有效的自动化攻击在很大程度上仍未得到充分探索。现有方法严重依赖人工红队和手工构建的提示,限制了其可扩展性和适应性。我们提出了AutoInject,一个强化学习框架,该框架能生成通用、可迁移的对抗性后缀,同时联合优化攻击成功率和在良性任务上的效用保持。我们的黑盒方法同时支持基于查询的优化以及对未见模型和任务的迁移攻击。仅使用一个15亿参数的对抗性后缀生成器,我们便在AgentDojo基准测试中成功攻破了包括GPT 5 Nano、Claude Sonnet 3.5和Gemini 2.5 Flash在内的前沿系统,为自动化提示注入研究建立了一个更强的基线。