Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.
翻译:分解式越狱攻击通过将恶意目标拆解为一系列表面上无害的查询序列,使攻击者能够绕过大型语言模型(LLMs)的安全防护,最终重构被禁止的内容。在实际部署中,LLMs面临持续且不可追踪的匿名化请求流,其中夹杂着隐蔽分布的对抗性查询。在此严苛威胁模型下,现有防御策略存在根本性局限:缺乏可信用户元数据时无法追踪全局历史上下文,而采用生成模型进行实时监控则引入计算开销过高的瓶颈。针对此问题,我们提出TwinGate——一种状态化双编码器防御框架。该框架采用非对称对比学习(ACL)将语义相异但意图匹配的恶意片段聚类至共享潜在空间,同时通过并行的冻结编码器抑制良性主题重叠导致的误报。每个请求仅需单次轻量级前向传播,使防御模块能以极低延迟开销与目标模型的预填充阶段并行执行。为评估该方法并推动后续研究,我们构建了涵盖8,600种不同恶意意图、超过362万条指令的综合数据集。在严格因果协议下的大规模语料评估中,TwinGate在保持极低误报率的同时实现了高恶意意图召回率,并对自适应攻击展现出强鲁棒性。此外,我们的方法在吞吐量和延迟方面显著优于各类状态化与无状态基线方案。