Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.

翻译：大语言模型（LLM）在自然语言理解、推理与生成方面展现出卓越能力。然而，这些系统仍易受恶意提示影响，此类提示通过有害请求、越狱技术和提示注入攻击诱导模型产生不安全或违反策略的行为。现有防御机制面临根本性局限：黑盒审核API透明度有限且难以适应不断演变的威胁，而采用大型LLM作为判别器的白盒方法则带来极高的计算成本，且需针对新攻击进行昂贵的重新训练。现有系统迫使设计者在性能、效率与适应性之间做出取舍。为应对这些挑战，我们提出BAGEL（自助聚合集成层），这是一种模块化、轻量级且支持增量更新的恶意提示检测框架。BAGEL采用受自助聚合与专家混合思想启发的微调模型集成架构，每个模型专精于不同的攻击数据集。推理时，BAGEL通过随机森林路由机制识别最合适的集成成员，并采用随机选择策略采样额外成员进行预测聚合。当新型攻击出现时，BAGEL通过微调小型提示安全分类器（8600万参数）并将所得模型加入集成来实现增量更新。BAGEL仅需选择5个集成成员（4.3亿参数）即可达到0.92的F1分数，其性能优于需要数十亿参数的OpenAI审核API与ShieldGemma系统。经过九次增量更新后性能保持稳健，且BAGEL通过路由器的结构化特征提供可解释性。研究结果表明，由小型微调分类器构成的集成系统能够匹配甚至超越数十亿参数的防护机制，同时为生产系统提供所需的适应性与效率。