Probe-Me-Not: Protecting Pre-trained Encoders from Malicious Probing

Adapting pre-trained deep learning models to customized tasks has become a popular choice for developers to cope with limited computational resources and data volume. More specifically, probing--training a downstream head on a pre-trained encoder--has been widely adopted in transfer learning, which helps to prevent overfitting and catastrophic forgetting. However, such generalizability of pre-trained encoders raises concerns about the potential misuse of probing for harmful intentions, such as discriminatory speculation and warfare applications. In this work, we introduce EncoderLock, a novel applicability authorization method designed to protect pre-trained encoders from malicious probing, i.e., yielding poor performance on specified prohibited domains while maintaining their utility in authorized ones. Achieving this balance is challenging because of the opposite optimization objectives and the variety of downstream heads that adversaries can utilize adaptively. To address these challenges, EncoderLock employs two techniques: domain-aware weight selection and updating to restrict applications on prohibited domains/tasks, and self-challenging training scheme that iteratively strengthens resistance against any potential downstream classifiers that adversaries may apply. Moreover, recognizing the potential lack of data from prohibited domains in practical scenarios, we introduce three EncoderLock variants with different levels of data accessibility: supervised (prohibited domain data with labels), unsupervised (prohibited domain data without labels), and zero-shot (no data or labels available). We verify EncoderLock's effectiveness and practicality with a real-world pre-trained Vision Transformer (ViT) encoder from Facebook. These results underscore the valuable contributions EncoderLock brings to the development of responsible AI.

翻译：将预训练的深度学习模型适配到定制化任务已成为开发者应对有限计算资源和数据规模的主流选择。具体而言，在预训练编码器上训练下游任务头（即探测）的方法已被广泛用于迁移学习，这有助于防止过拟合和灾难性遗忘。然而，预训练编码器的这种泛化能力引发了人们对恶意探测潜在滥用的担忧，例如用于歧视性推断或军事应用。本文提出EncoderLock，一种新颖的适用性授权方法，旨在保护预训练编码器免受恶意探测，即在保持其在授权领域效用的同时，在指定的禁止领域上产生较差的性能。由于优化目标相悖且攻击者可自适应地利用多种下游任务头，实现这种平衡具有挑战性。为解决这些挑战，EncoderLock采用两种技术：通过领域感知的权重选择与更新来限制在禁止领域/任务上的应用，以及通过自我挑战训练方案迭代增强对攻击者可能采用的任何潜在下游分类器的抵抗能力。此外，考虑到实际场景中可能缺乏禁止领域数据，我们提出了三种具有不同数据可获取性的EncoderLock变体：有监督（含标签的禁止领域数据）、无监督（无标签的禁止领域数据）和零样本（无可用数据或标签）。我们使用Facebook提供的真实预训练Vision Transformer（ViT）编码器验证了EncoderLock的有效性与实用性。这些结果凸显了EncoderLock对发展负责任人工智能的重要贡献。