The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names of individuals from models, and focus in this work on text encoders. Specifically, through strategic insertion of backdoors, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's name. Our empirical results demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides not only a new "dual-use" perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
翻译:大规模AI模型在未经过滤、常包含敏感网络爬取数据上的训练激增,引发了严重的隐私担忧。其中一个担忧在于,攻击者能够利用隐私攻击提取关于训练数据的信息。然而,在不牺牲模型性能的前提下移除特定信息并非易事,已被证明极具挑战性。本文提出了一种基于后门攻击的简便而有效的防御方法,用于从模型中移除个人姓名等隐私信息,并重点研究了文本编码器。具体而言,通过策略性地插入后门,我们将敏感短语的嵌入与中性术语(如“某人”而非具体人名)的嵌入对齐。实验结果表明,我们基于后门的防御方法在CLIP模型上效果显著——通过针对零样本分类器的专用隐私攻击进行评估。我们的方法不仅为后门攻击提供了新的“双重用途”视角,也为提升基于未过滤网络爬取数据训练的模型中的个体隐私保护开辟了有前景的路径。