Large Language Models Can Be Contextual Privacy Protection Learners

The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. Nevertheless, such domain-specific fine-tuning data often contains contextually sensitive personally identifiable information (PII). Direct fine-tuning of LLMs on this data without privacy protection poses a risk of data leakage of sensitive PII during inference time. To address this challenge, we introduce Contextual Privacy Protection Language Models (CPPLM), a novel paradigm for fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding inference-time data privacy. Our work offers a theoretical analysis for model design and benchmarks various techniques such as corpus curation, penalty-based unlikelihood in training loss, instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. In particular, instruction tuning with both positive and negative examples stands out as a promising method, effectively protecting private data while enhancing the model's knowledge. Our work underscores the potential for Large Language Models as robust contextual privacy protection learners. The complete code and data for the work can be found at https://github.com/Yijia-Xiao/PPLM.

翻译：大型语言模型（LLMs）的普及推动了利用领域特定数据对其进行微调以创建专用语言模型的广泛兴趣。然而，此类领域特定的微调数据通常包含上下文敏感的个人可识别信息（PII）。若在无隐私保护的情况下直接基于此类数据对LLMs进行微调，存在推理阶段敏感PII数据泄露的风险。为应对这一挑战，我们提出了上下文隐私保护语言模型（CPPLM），这是一种新颖的LLM微调范式，能在有效注入领域知识的同时保障推理阶段的数据隐私。本研究为模型设计提供了理论分析，并对多种技术进行了基准测试，包括语料筛选、基于惩罚的训练损失非似然性、基于指令的微调等。跨多种数据集和场景的广泛实验证明了我们方法的有效性。特别地，结合正负例的指令微调表现突出，成为一种有效的方法，在保护私有数据的同时增强了模型的知识。我们的工作凸显了大型语言模型作为强大上下文隐私保护学习器的潜力。本研究的完整代码与数据可在 https://github.com/Yijia-Xiao/PPLM 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日