LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.
翻译:LoRA适配器使用户能够高效地对大型语言模型进行微调。然而,LoRA适配器通过Hugging Face Hub等开放存储库共享,使其容易受到后门攻击。当前的检测方法需要使用测试输入数据运行模型——这在筛选数千个适配器时并不实用,因为后门行为的触发机制未知。我们通过直接分析适配器的权重矩阵来检测被投毒的适配器,无需运行模型——这使得我们的方法不依赖于数据。我们的方法提取简单的统计量——奇异值的集中程度、其熵值以及分布形状——并标记偏离正常模式的适配器。我们在500个LoRA适配器上评估了该方法——其中400个是干净的,100个是针对Llama-3.2-3B模型在指令和推理数据集上被投毒的:Alpaca、Dolly、GSM8K、ARC-Challenge、SQuADv2、NaturalQuestions、HumanEval以及GLUE数据集。我们实现了97%的检测准确率,且误报率低于2%。