Proactive Hardening of LLM Defenses with HASTE

Prompt-based attack techniques are one of the primary challenges in securely deploying and protecting LLM-based AI systems. LLM inputs are an unbounded, unstructured space. Consequently, effectively defending against these attacks requires proactive hardening strategies capable of continuously generating adaptive attack vectors to optimize LLM defense at runtime. We present HASTE (Hard-negative Attack Sample Training Engine): a systematic framework that iteratively engineers highly evasive prompts, within a modular optimization process, to continuously enhance detection efficacy for prompt-based attack techniques. The framework is agnostic to synthetic data generation methods, and can be generalized to evaluate prompt-injection detection efficacy, with and without fuzzing, for any hard-negative or hard-positive iteration strategy. Experimental evaluation of HASTE shows that hard negative mining successfully evades baseline detectors, reducing malicious prompt detection for baseline detectors by approximately 64%. However, when integrated with detection model re-training, it optimizes the efficacy of prompt detection models with significantly fewer iteration loops compared to relative baseline strategies. The HASTE framework supports both proactive and reactive hardening of LLM defenses and guardrails. Proactively, developers can leverage HASTE to dynamically stress-test prompt injection detection systems; efficiently identifying weaknesses and strengthening defensive posture. Reactively, HASTE can mimic newly observed attack types and rapidly bridge detection coverage by teaching HASTE-optimized detection models to identify them.

翻译：提示型攻击技术是安全部署和保护基于LLM的AI系统面临的主要挑战之一。LLM输入是一个无界、非结构化的空间。因此，有效防御此类攻击需要采用主动强化策略，能够持续生成自适应攻击向量以在运行时优化LLM防御。本文提出HASTE（难负样本攻击训练引擎）：一种在模块化优化过程中迭代生成高规避性提示的系统框架，旨在持续提升针对提示型攻击技术的检测效能。该框架与合成数据生成方法无关，可推广至评估任何难负样本或难正样本迭代策略下（含模糊测试及不含模糊测试）的提示注入检测效能。HASTE的实验评估表明：难负样本挖掘能成功规避基线检测器，使基线检测器对恶意提示的检测率降低约64%；但当与检测模型重新训练结合时，相比相对基线策略，该框架能以显著更少的迭代轮次优化提示检测模型的效能。HASTE框架支持LLM防御与安全护栏的主动及被动强化：在主动层面，开发者可利用HASTE动态压力测试提示注入检测系统，高效识别弱点并增强防御态势；在被动层面，HASTE能模拟新观测到的攻击类型，通过训练经HASTE优化的检测模型识别此类攻击，快速弥补检测覆盖缺口。