Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model's output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks--Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)--as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.
翻译:大型语言模型(LLM)日益在包含敏感数据集级别属性的领域特定数据集上进行微调。近期研究表明,此类数据集级别的信息可通过属性推断攻击被有效提取,构成机密性风险。现有防御措施主要通过修改训练数据分布来运作,因此需要访问原始数据并重新训练模型,这限制了其在无法获取数据或模型已部署场景中的适用性。本研究提出基于对齐的防御机制来缓解LLM中的属性推断攻击。该方法通过后训练对齐重塑模型输出分布至目标属性比率,无需修改训练数据。具体而言,我们采用两种广泛使用的基于人类反馈的强化学习(RLHF)框架——直接偏好优化(DPO)和组相对策略优化(GRPO)——分别通过构建偏好对和定义特定奖励函数来实现防御。综合实验表明,基于对齐的防御机制在有效缓解属性推断攻击的同时,保持了良好的效用-机密性权衡。