Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
翻译:在预训练与用户部署之间,通过强化学习(RL)对齐大型语言模型(LLM)已成为训练指令遵循模型(如ChatGPT)的主流策略。在本工作中,我们首次将差分隐私(DP)与强化学习相结合,研究LLM的隐私保护对齐问题。遵循Ziegler等人(2020)的开创性工作,我们研究了两种主流范式:(i)无人工介入的强化学习对齐(如正面评论生成)与(ii)基于人类反馈的强化学习对齐(RLHF)(如以符合人类偏好的方式进行摘要生成)。我们提出了一种新颖的差分隐私框架以实现基于强化学习的对齐,并证明了其正确性。实验结果验证了我们方法的有效性,在确保强隐私保护的同时实现了具备竞争力的效用。