Alignment Defends LLMs from Property Inference Attacks

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model's output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks--Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)--as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.

翻译：大型语言模型（LLM）日益在包含敏感数据集级别属性的领域特定数据集上进行微调。近期研究表明，此类数据集级别的信息可通过属性推断攻击被有效提取，构成机密性风险。现有防御措施主要通过修改训练数据分布来运作，因此需要访问原始数据并重新训练模型，这限制了其在无法获取数据或模型已部署场景中的适用性。本研究提出基于对齐的防御机制来缓解LLM中的属性推断攻击。该方法通过后训练对齐重塑模型输出分布至目标属性比率，无需修改训练数据。具体而言，我们采用两种广泛使用的基于人类反馈的强化学习（RLHF）框架——直接偏好优化（DPO）和组相对策略优化（GRPO）——分别通过构建偏好对和定义特定奖励函数来实现防御。综合实验表明，基于对齐的防御机制在有效缓解属性推断攻击的同时，保持了良好的效用-机密性权衡。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

大语言模型机器遗忘综述

专知会员服务

18+阅读 · 2025年11月2日

【ACL2024教程】大型语言模型对抗攻击的脆弱性，200多页ppt

专知会员服务

34+阅读 · 2024年8月14日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日