Not All Entities are Created Equal: A Dynamic Anonymization Framework for Privacy-Preserving Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) enhances the utility of Large Language Models (LLMs) by retrieving external documents. Since the knowledge databases in RAG are predominantly utilized via cloud services, private data in sensitive domains such as finance and healthcare faces the risk of personal information leakage. Thus, effectively anonymizing knowledge bases is crucial for privacy preservation. Existing studies equate the privacy risk of text to the linear superposition of the privacy risks of individual, isolated sensitive entities. The "one-size-fits-all" full processing of all sensitive entities severely degrades utility of LLM. To address this issue, we introduce a dynamic anonymization framework named TRIP-RAG. Based on context-aware entity quantification, this framework evaluates entities from the perspectives of marginal privacy risk, knowledge divergence, and topical relevance. It identifies highly sensitive entities while trading off utility, providing a feasible approach for variable-intensity privacy protection scenarios. Our theoretical analysis and experiments indicate that TRIP-RAG can effectively reduce context inference risks. Extensive experimental results demonstrate that, while maintaining privacy protection comparable to full anonymization, TRIP-RAG's Recall@k decreases by less than 35% compared to the original data, and the generation quality improves by up to 56% over existing baselines.

翻译：检索增强生成（RAG）通过检索外部文档增强了大语言模型（LLM）的效用。由于RAG中的知识库主要通过云服务使用，金融和医疗等敏感领域的私有数据面临个人信息泄露风险。因此，有效匿名化知识库对隐私保护至关重要。现有研究将文本的隐私风险等同为孤立敏感实体隐私风险的线性叠加，对所有敏感实体采取"一刀切"的全量处理方式严重损害了LLM的效用。针对此问题，我们提出名为TRIP-RAG的动态匿名化框架。该框架基于上下文感知实体量化方法，从边际隐私风险、知识离散度和主题相关性三个维度评估实体，在权衡效用的同时识别高敏感度实体，为变强度隐私保护场景提供了可行方案。理论分析与实验表明，TRIP-RAG能有效降低上下文推断风险。大量实验结果显示，在保持与全量匿名化相当的隐私保护水平下，TRIP-RAG的Recall@k相较于原始数据下降不超过35%，生成质量较现有基线最高提升56%。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【AAAI2026】TruthfulRAG：基于知识图谱解决检索增强生成中的事实层冲突

专知会员服务

22+阅读 · 2025年11月15日

检索增强生成（RAG）技术，261页slides

专知会员服务

42+阅读 · 2025年10月16日

【SIGIR2025教程】动态与参数化检索增强生成

专知会员服务

17+阅读 · 2025年7月14日

视觉中的检索增强生成与理解：综述与新展望

专知会员服务

24+阅读 · 2025年4月6日