This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.
翻译:摘要:本文提出了一种名为Sentra-Guard的实时模块化防御系统。该系统能够检测并缓解针对大语言模型(LLM)的越狱攻击和提示注入攻击。该框架采用混合架构,结合了基于FAISS索引的SBERT嵌入表示(用于捕捉提示的语义含义)与微调后的Transformer分类器(一种专门区分良性语言输入与对抗性语言输入的机器学习模型),可识别直接攻击与混淆攻击向量中的对抗性提示。其核心创新在于分类-检索融合模块,该模块能动态计算上下文感知的风险分数,基于提示内容与上下文估计其成为对抗性提示的可能性。框架通过语言无关的预处理层确保多语言韧性:该组件自动将非英语提示翻译为英语进行语义评估,从而实现对100多种语言的一致检测。系统包含人类参与反馈(HITL)循环,即自动化系统的决策由人类专家进行审查,以实现持续学习与对抗压力下的快速自适应。Sentra-Guard维护一个不断更新的良性/恶意提示双标签知识库,提升了检测可靠性并降低了误报率。评估结果显示,系统的检测率达到99.96%(AUC=1.00,F1=1.00),攻击成功率(ASR)仅为0.004%。这一性能优于LlamaGuard-2(1.3%)和OpenAI Moderation(3.7%)等主流基线方法。与黑盒方法不同,Sentra-Guard具有透明性、可微调性,并兼容多种LLM后端。其模块化设计支持在商业与开源环境中进行可扩展部署。该系统建立了对抗性大语言模型防御的新标杆。