Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
翻译:介于预训练与用户部署之间,通过强化学习(RL)对齐大型语言模型(LLMs)已成为训练指令遵循模型(如ChatGPT)的主流策略。本研究首次将差分隐私(DP)与强化学习相结合,探索LLMs的隐私保护对齐问题。继Ziegler等人(2020)具有里程碑意义的工作后,我们研究了两种主流范式:(i)无需人工参与的强化学习对齐(如正面评价生成)与(ii)基于人类反馈的强化学习对齐(RLHF)(如以人类偏好方式生成摘要)。我们提出了一种新型DP框架以实现基于RL的对齐,并证明了其正确性。实验结果表明,该方法在确保强隐私保护的同时,能提供具有竞争力的效用,验证了其有效性。