Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.

翻译：大型语言模型（LLM）正日益部署于安全敏感的应用中，但仍易受后门攻击。然而，现有的后门防御方法难以实现后门防御即服务（BDaaS），因为它们需要不切实际的辅助信息（如下游干净数据、已知触发器/目标或任务领域细节），且缺乏跨不同后门模型的可复用、可扩展的净化机制。本文提出PROTOPURIFY，一种在最小假设下通过参数编辑实现的后门净化框架。PROTOPURIFY首先从干净模型与后门模型对中构建后门向量池，将向量聚合为候选原型，并通过相似度匹配为目标模型选择最匹配的候选原型。随后，PROTOPURIFY通过逐层原型对齐识别边界层，并通过抑制受影响层中与原型对齐的分量进行定向净化，在最小化对良性功能影响的同时实现细粒度缓解。作为面向BDaaS设计的底层原语，PROTOPURIFY支持可复用性、可定制性、可解释性与运行时效率。在多种LLM上针对分类与生成任务的实验表明，PROTOPURIFY在应对6种不同攻击（包括单触发器、多触发器及无触发器后门设置）时，持续优于6种代表性防御方法。PROTOPURIFY将攻击成功率（ASR）降至10%以下，部分情况下甚至低至1.6%，同时仅导致不足3%的干净功能下降。PROTOPURIFY进一步展现出对自适应后门变体的鲁棒性，以及在非后门模型上的稳定性。