Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts

This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.

翻译：摘要：本文提出了一种名为Sentra-Guard的实时模块化防御系统。该系统能够检测并缓解针对大语言模型（LLM）的越狱攻击和提示注入攻击。该框架采用混合架构，结合了基于FAISS索引的SBERT嵌入表示（用于捕捉提示的语义含义）与微调后的Transformer分类器（一种专门区分良性语言输入与对抗性语言输入的机器学习模型），可识别直接攻击与混淆攻击向量中的对抗性提示。其核心创新在于分类-检索融合模块，该模块能动态计算上下文感知的风险分数，基于提示内容与上下文估计其成为对抗性提示的可能性。框架通过语言无关的预处理层确保多语言韧性：该组件自动将非英语提示翻译为英语进行语义评估，从而实现对100多种语言的一致检测。系统包含人类参与反馈（HITL）循环，即自动化系统的决策由人类专家进行审查，以实现持续学习与对抗压力下的快速自适应。Sentra-Guard维护一个不断更新的良性/恶意提示双标签知识库，提升了检测可靠性并降低了误报率。评估结果显示，系统的检测率达到99.96%（AUC=1.00，F1=1.00），攻击成功率（ASR）仅为0.004%。这一性能优于LlamaGuard-2（1.3%）和OpenAI Moderation（3.7%）等主流基线方法。与黑盒方法不同，Sentra-Guard具有透明性、可微调性，并兼容多种LLM后端。其模块化设计支持在商业与开源环境中进行可扩展部署。该系统建立了对抗性大语言模型防御的新标杆。