LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data -- making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model -- making our method trigger-agnostic. For each attention projection (Q, K, V, O), our method extracts five spectral statistics from the low-rank update $ΔW$, yielding a 20-dimensional signature for each adapter. A logistic regression detector trained on this representation separates benign and poisoned adapters across three model families -- Llama-3.2-3B~\citep{llama3}, Qwen2.5-3B~\citep{qwen25}, and Gemma-2-2B~\citep{gemma2} -- on unseen test adapters drawn from instruction-following, reasoning, question-answering, code, and classification tasks. Across all three architectures, the detector achieves 100\% accuracy.
翻译:LoRA适配器使用户能够高效微调大语言模型(LLMs)。然而,由于LoRA适配器通过Hugging Face Hub等开放仓库共享,容易遭受后门攻击。现有检测方法需在测试输入数据上运行模型,这使得筛选成千上万个未知触发行为的适配器不切实际。我们通过直接分析适配器的权重矩阵来检测被污染的适配器,无需运行模型——使我们的方法与触发器无关。针对每个注意力投影(Q、K、V、O),我们的方法从低秩更新$ΔW$中提取五个谱统计量,为每个适配器生成20维签名。基于此表示训练的Logistic回归检测器可在三个模型家族——Llama-3.2-3B、Qwen2.5-3B和Gemma-2-2B——上区分良性适配器与受损适配器,测试对象涵盖指令遵循、推理、问答、代码和分类任务的未见适配器。在所有三种架构上,该检测器均达到100%的准确率。