Backdoored and privacy-leaking deep neural networks pose a serious threat to the deployment of machine learning systems in security-critical settings. Existing defenses for backdoor detection and membership inference typically require access to clean reference models, extensive retraining, or strong assumptions about the attack mechanism. In this work, we introduce a novel LoRA-based oracle framework that leverages low-rank adaptation modules as a lightweight, model-agnostic probe for both backdoor detection and membership inference. Our approach attaches task-specific LoRA adapters to a frozen backbone and analyzes their optimization dynamics and representation shifts when exposed to suspicious samples. We show that poisoned and member samples induce distinctive low-rank updates that differ significantly from those generated by clean or non-member data. These signals can be measured using simple ranking and energy-based statistics, enabling reliable inference without access to the original training data or modification of the deployed model.
翻译:后门与隐私泄露的深度神经网络对机器学习系统在安全关键场景中的部署构成了严重威胁。现有的后门检测与成员推理防御方法通常需要访问干净的参考模型、进行大量重新训练,或对攻击机制做出强假设。本文提出了一种新颖的基于 LoRA 的预言机框架,该框架利用低秩自适应模块作为一种轻量级、模型无关的探针,用于后门检测和成员推理。我们的方法将特定任务的 LoRA 适配器附加到冻结的主干网络上,并分析其在暴露于可疑样本时的优化动态和表示偏移。我们证明,中毒样本和成员样本会引发独特的低秩更新,这些更新与干净数据或非成员数据产生的更新存在显著差异。这些信号可以通过简单的排序和基于能量的统计量进行测量,从而在无需访问原始训练数据或修改已部署模型的情况下实现可靠的推理。