Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.
翻译:大型语言模型(LLMs)已变革自然语言处理,但始终面临提示注入(Prompt Injection, PI)和越狱(Jailbreak, JB)攻击的脆弱性。此外,基准评估可能受数据污染和部分信息泄露影响,导致性能估计偏差。本文提出GuardNet,一种基于约4700万参数浅层神经网络(BiLSTMs)集成的防护栏系统。我们探究以下假设:对抗场景下的鲁棒性更多取决于样本覆盖多样性及阈值校准,而非模型规模。结果表明,尽管Mistral-7B和Llama-3.1-8B等大型LLM在盲测JBB-Behaviors基准上的F1分数和AUROC指标仍具优势,但GuardNet与轻量级检测器相比实现了竞争性性能与低延迟高能效。具体而言,在阈值校准及声明部分信息泄露条件下,GuardNet在盲测数据集(n=200)上达到0.747 AUROC,在专有基准(n=50)上获得0.92 F1分数。该系统在CPU上运行平均延迟约50毫秒,使其适用于存在成本和基础设施约束的生产环境部署。