PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

翻译：大型语言模型正日益广泛地应用于面向患者的医疗辅助和临床决策支持，然而将其适配至临床对话场景通常需要基于医患对话衍生的监督信号，此类对话可能包含敏感信息。传统的监督微调与基于人类反馈的强化学习（RLHF）可能加剧记忆风险，导致经验性成员推断攻击及罕见训练集内容的提取。本文提出PrivMedChat，一个面向医疗对话的端到端差分隐私RLHF（DP-RLHF）框架。我们的设计在直接访问对话衍生监督信号的每个训练阶段均实施差分隐私保护：（i）采用差分隐私随机梯度下降（DP-SGD）进行医疗监督微调；（ii）基于偏好对进行奖励模型学习时采用DP-SGD。为限制对齐过程中的额外隐私消耗，我们在处理对话衍生提示时对PPO执行器与评判器应用DP-SGD，而奖励模型在完成DP训练后保持固定。此外，我们提出一种无需人工标注的偏好构建策略，通过将医师响应与筛选后的非专业生成结果配对，无需临床医生标注即可生成可扩展的偏好数据。在医疗对话基准测试上的实验表明，当隐私预算$\varepsilon=7$时，PrivMedChat在所有差分隐私模型中取得最高的ROUGE-L分数0.156，将临床幻觉率降至1.4%、有害建议率降至0.4%，并在三模型LLM评审评估中获得最高综合得分2.86，同时产生的成员推断信号接近随机水平（AUC 0.510-0.555）。我们在https://github.com/sudip-bhujel/privmedchat开源了相关代码。