Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Prompt injection attacks, where untrusted data contains an injected prompt to manipulate the system, have been listed as the top security threat to LLM-integrated applications. Model-level prompt injection defenses have shown strong effectiveness, but the strongest defenses are proprietary. Open-source secure models are needed by the AI security community so that co-development of attacks and defenses through open research can drive scientific progress in mitigating prompt injection attacks. To this end, we develop Meta SecAlign, the first fully open-source LLM with built-in model-level defense that achieves commercial-grade performance and is powerful enough for complex agentic tasks. We provide complete details of our training recipe. We perform the most comprehensive evaluation to date on 9 utility benchmarks (measuring general knowledge, instruction following, and agentic workflows) and 7 security benchmarks. Results show that Meta SecAlign, despite being trained only on generic instruction-tuning samples, surprisingly confers security in unseen downstream tasks, including tool-calling and web-navigation, in addition to general instruction-following. Our best model -- Meta-SecAlign-70B -- establishes a new frontier of utility-security trade-off for open-source LLMs, and is more secure than several flagship proprietary models with prompt injection defense. Below are links for the code (https://github.com/facebookresearch/Meta_SecAlign), Meta-SecAlign-70B (https://huggingface.co/facebook/Meta-SecAlign-70B), and Meta-SecAlign-8B (https://huggingface.co/facebook/Meta-SecAlign-8B) models.

翻译：提示注入攻击，即不受信任的数据包含被注入的提示以操纵系统，已被列为LLM集成应用的首要安全威胁。模型级提示注入防御已显示出强大的有效性，但最强的防御方案是专有的。AI安全社区需要开源的安全模型，以便通过开放研究共同开发攻击和防御，从而推动缓解提示注入攻击的科学研究进展。为此，我们开发了Meta SecAlign，这是首个完全开源、内置模型级防御的大语言模型，其性能达到商业级水平，且足够强大以处理复杂的智能体任务。我们提供了完整的训练方案细节。我们进行了迄今为止最全面的评估，涵盖9个效用基准（衡量通用知识、指令遵循和智能体工作流）和7个安全基准。结果表明，尽管Meta SecAlign仅使用通用指令微调样本进行训练，却出人意料地在未见的下游任务（包括工具调用和网络导航）以及通用指令遵循中赋予了安全性。我们的最佳模型——Meta-SecAlign-70B——为开源大语言模型在效用与安全权衡方面树立了新的前沿，其安全性甚至超过多个具备提示注入防御功能的旗舰专有模型。以下是代码（https://github.com/facebookresearch/Meta_SecAlign）、Meta-SecAlign-70B（https://huggingface.co/facebook/Meta-SecAlign-70B）和Meta-SecAlign-8B（https://huggingface.co/facebook/Meta-SecAlign-8B）模型的链接。