Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs. Our code is available at https://github.com/Jasper-Yan/TRACE-RPS.
翻译:近期研究表明,大语言模型(LLMs)能够从用户在线分享的文本中推断出私人属性(如年龄、位置、性别),从而导致快速、大规模的隐私泄露。现有的基于匿名化的防御方法较为粗粒度,在匿名化泄露隐私的文本元素时缺乏词语级别的精确性。此外,这些方法存在固有局限性:通过修改用户文本来隐藏敏感线索,仍可能被模型通过推理能力进行属性推断。为应对这些局限,我们提出了一个统一的防御框架,该框架结合了细粒度匿名化(TRACE)与防止推断的优化(RPS)。TRACE 利用注意力机制和推断链生成来识别并匿名化泄露隐私的文本元素,而 RPS 则采用轻量级的两阶段优化策略来诱导模型产生拒绝行为,从而阻止属性推断。在不同大语言模型上的评估表明,TRACE-RPS 将开源模型上的属性推断准确率从约 50% 降低至 5% 以下。此外,我们的方法展现出强大的跨模型泛化能力、对提示词变化的鲁棒性以及效用与隐私的良好权衡。我们的代码可在 https://github.com/Jasper-Yan/TRACE-RPS 获取。