This is the first survey of the active area of AI research that focuses on privacy issues in Large Language Models (LLMs). Specifically, we focus on work that red-teams models to highlight privacy risks, attempts to build privacy into the training or inference process, enables efficient data deletion from trained models to comply with existing privacy regulations, and tries to mitigate copyright issues. Our focus is on summarizing technical research that develops algorithms, proves theorems, and runs empirical evaluations. While there is an extensive body of legal and policy work addressing these challenges from a different angle, that is not the focus of our survey. Nevertheless, these works, along with recent legal developments do inform how these technical problems are formalized, and so we discuss them briefly in Section 1. While we have made our best effort to include all the relevant work, due to the fast moving nature of this research we may have missed some recent work. If we have missed some of your work please contact us, as we will attempt to keep this survey relatively up to date. We are maintaining a repository with the list of papers covered in this survey and any relevant code that was publicly available at https://github.com/safr-ml-lab/survey-llm.
翻译:本文是对人工智能研究活跃领域中聚焦于大型语言模型(LLM)隐私问题的首次综述。具体而言,我们重点梳理以下研究方向:通过红队测试揭示模型隐私风险的工作、尝试在训练或推理过程中嵌入隐私保护机制的研究、实现已训练模型中高效数据删除以符合现有隐私法规的方法,以及试图缓解版权问题的技术。我们关注的核心是那些开发算法、证明定理并进行实证评估的技术性研究总结。虽然已有大量法律与政策类文献从不同角度探讨这些挑战,但这并非本综述的重点。然而,这些研究工作以及近期法律进展确实影响了技术问题的形式化表述,因此我们将在第1节简要讨论。尽管我们尽最大努力纳入所有相关研究,但由于该领域发展迅速,可能遗漏部分最新成果。若您的研究未被收录,请与我们联系,我们将持续更新本综述。我们维护着一个包含本文所涉论文列表及相关公开代码的仓库,地址为:https://github.com/safr-ml-lab/survey-llm。