This is the first survey of the active area of AI research that focuses on privacy issues in Large Language Models (LLMs). Specifically, we focus on work that red-teams models to highlight privacy risks, attempts to build privacy into the training or inference process, enables efficient data deletion from trained models to comply with existing privacy regulations, and tries to mitigate copyright issues. Our focus is on summarizing technical research that develops algorithms, proves theorems, and runs empirical evaluations. While there is an extensive body of legal and policy work addressing these challenges from a different angle, that is not the focus of our survey. Nevertheless, these works, along with recent legal developments do inform how these technical problems are formalized, and so we discuss them briefly in Section 1. While we have made our best effort to include all the relevant work, due to the fast moving nature of this research we may have missed some recent work. If we have missed some of your work please contact us, as we will attempt to keep this survey relatively up to date. We are maintaining a repository with the list of papers covered in this survey and any relevant code that was publicly available at https://github.com/safr-ml-lab/survey-llm.
翻译:本文是对人工智能研究活跃领域——大型语言模型(LLMs)中隐私问题的首次综述。具体而言,我们聚焦于以下方向的工作:利用红队测试模型以揭示隐私风险、尝试在训练或推理过程中构建隐私保护机制、从已训练模型中实现高效数据删除以遵守现有隐私法规,以及试图缓解版权问题。我们的重点在于总结那些开发算法、证明定理并开展实证评估的技术研究。尽管已有大量从不同角度处理这些挑战的法律与政策工作,但这并非本综述的重点。然而,这些工作以及近期的法律发展确实影响了这些技术问题的形式化方式,因此我们将在第1节简要讨论。尽管我们已尽最大努力涵盖所有相关研究,但由于该领域进展迅速,可能遗漏了一些近期工作。若您的研究被忽略,请联系我们,我们将尽力保持本综述的相对时效性。我们维护了一个包含本综述所涵盖论文列表及公开可用代码的仓库,地址为:https://github.com/safr-ml-lab/survey-llm。