LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method data-agnostic. Our method extracts simple statistics -- how concentrated the singular values are, their entropy, and the distribution shape -- and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97\% detection accuracy with less than 2\% false positives.
翻译:LoRA适配器使用户能够高效微调大语言模型(LLM)。然而,LoRA适配器通过Hugging Face Hub等开放存储库共享,使其易受后门攻击。现有检测方法需使用测试输入数据运行模型——这在需筛查数千个触发机制未知的适配器时并不实用。我们通过直接分析权重矩阵来检测中毒适配器,无需运行模型,使方法具备数据无关性。该方法提取简单统计量——奇异值集中程度、熵值及分布形态——并标记偏离正常模式的适配器。我们在500个LoRA适配器上评估该方法(400个清洁样本,100个针对Llama-3.2-3B的中毒样本),测试数据集涵盖指令与推理任务:Alpaca、Dolly、GSM8K、ARC-Challenge、SQuADv2、NaturalQuestions、HumanEval及GLUE数据集。实验实现了97%的检测准确率,且误报率低于2%。