The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names and faces of individuals from vision-language models by fine-tuning them for only a few minutes instead of re-training them from scratch. Specifically, through strategic insertion of backdoors into text encoders, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's actual name. For image encoders, we map embeddings of individuals to be removed from the model to a universal, anonymous embedding. Our empirical results demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides not only a new "dual-use" perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
翻译:大型AI模型在未经筛选的、通常包含敏感信息的网络抓取数据上训练,其广泛使用引发了重大的隐私担忧。其中一个担忧是攻击者可以利用隐私攻击提取关于训练数据的信息。不幸的是,在不牺牲性能的情况下从模型中移除特定信息并非易事,且已被证明具有挑战性。我们提出了一种基于后门攻击的相当简单且有效的防御方法,通过仅微调模型几分钟(而非从头重新训练),即可从视觉-语言模型中移除个人姓名和面部等隐私信息。具体来说,通过策略性地在文本编码器中插入后门,我们将敏感短语的嵌入向量与中性术语(如“一个人”而非实际姓名)的嵌入向量对齐。对于图像编码器,我们将要被从模型中移除的个体的嵌入向量映射到一个通用的匿名嵌入向量。我们的实验结果表明,在CLIP模型上,通过使用针对零样本分类器的专门隐私攻击评估性能,我们的基于后门的防御方法十分有效。我们的方法不仅为后门攻击提供了新的“双重用途”视角,也为增强在未经筛选的网络抓取数据上训练的模型中个人隐私保护提供了一条有前景的途径。