Substantial research works have shown that deep models, e.g., pre-trained models, on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks. However, these powerful models are also vulnerable to various privacy attacks, while much sensitive information exists in the training dataset. The attacker can easily steal sensitive information from public models, e.g., individuals' email addresses and phone numbers. In an attempt to address these issues, particularly the unauthorized use of private data, we introduce a novel watermarking technique via a backdoor-based membership inference approach named TextMarker, which can safeguard diverse forms of private information embedded in the training text data. Specifically, TextMarker only requires data owners to mark a small number of samples for data copyright protection under the black-box access assumption to the target model. Through extensive evaluation, we demonstrate the effectiveness of TextMarker on various real-world datasets, e.g., marking only 0.1% of the training dataset is practically sufficient for effective membership inference with negligible effect on model utility. We also discuss potential countermeasures and show that TextMarker is stealthy enough to bypass them.
翻译:大量研究表明,在大规模语料库上训练的深度模型(如预训练模型)能够学习通用的语言表征,这对下游自然语言处理任务具有显著益处。然而,这些强大模型也容易受到各类隐私攻击,而训练数据集中往往包含大量敏感信息。攻击者能够轻易从公开模型中窃取敏感信息,例如个人电子邮件地址和电话号码。为解决这些问题,特别是针对私有数据的未授权使用,我们提出了一种基于后门的成员推断水印技术TextMarker,该技术能够有效保护训练文本数据中嵌入的各类私有信息。具体而言,在仅需黑盒访问目标模型的假设下,TextMarker仅要求数据所有者标记少量样本即可实现数据版权保护。通过大量实验评估,我们在多个真实数据集上验证了TextMarker的有效性——仅标记训练数据集中0.1%的样本即可实现高效的成员推断,且对模型效用的影响可忽略不计。我们还探讨了潜在的防御措施,并证明TextMarker具有足够的隐蔽性以规避这些检测。