The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names of individuals from models, and focus in this work on text encoders. Specifically, through strategic insertion of backdoors, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's name. Our empirical results demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides not only a new "dual-use" perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
翻译:针对未经过滤且常包含敏感信息的网络抓取数据训练的大规模AI模型日益增多,引发了严重的隐私担忧。其中一个担忧是攻击者能够利用隐私攻击提取模型的训练数据信息。然而,在不牺牲性能的前提下从模型中移除特定信息并非易事,且已被证明颇具挑战性。我们提出了一种基于后门攻击的简便而有效的防御方法,用于从模型中移除个人信息(如个体姓名),并重点研究文本编码器。具体而言,通过策略性插入后门,我们将敏感短语的嵌入向量与中性词(如用"一个人"替代具体人名)的嵌入向量对齐。我们的实证结果表明,通过使用针对零样本分类器的专用隐私攻击评估CLIP模型性能,基于后门的防御方法具有有效性。该方法不仅为后门攻击提供了新的"双重用途"视角,也为提升基于未过滤网络抓取数据训练的模型中个人隐私保护提供了可行途径。