Retrieval-Augmented Generation (RAG) enhances the utility of Large Language Models (LLMs) by retrieving external documents. Since the knowledge databases in RAG are predominantly utilized via cloud services, private data in sensitive domains such as finance and healthcare faces the risk of personal information leakage. Thus, effectively anonymizing knowledge bases is crucial for privacy preservation. Existing studies equate the privacy risk of text to the linear superposition of the privacy risks of individual, isolated sensitive entities. The "one-size-fits-all" full processing of all sensitive entities severely degrades utility of LLM. To address this issue, we introduce a dynamic anonymization framework named TRIP-RAG. Based on context-aware entity quantification, this framework evaluates entities from the perspectives of marginal privacy risk, knowledge divergence, and topical relevance. It identifies highly sensitive entities while trading off utility, providing a feasible approach for variable-intensity privacy protection scenarios. Our theoretical analysis and experiments indicate that TRIP-RAG can effectively reduce context inference risks. Extensive experimental results demonstrate that, while maintaining privacy protection comparable to full anonymization, TRIP-RAG's Recall@k decreases by less than 35% compared to the original data, and the generation quality improves by up to 56% over existing baselines.
翻译:检索增强生成(RAG)通过检索外部文档增强了大语言模型(LLM)的效用。由于RAG中的知识库主要通过云服务使用,金融和医疗等敏感领域的私有数据面临个人信息泄露风险。因此,有效匿名化知识库对隐私保护至关重要。现有研究将文本的隐私风险等同为孤立敏感实体隐私风险的线性叠加,对所有敏感实体采取"一刀切"的全量处理方式严重损害了LLM的效用。针对此问题,我们提出名为TRIP-RAG的动态匿名化框架。该框架基于上下文感知实体量化方法,从边际隐私风险、知识离散度和主题相关性三个维度评估实体,在权衡效用的同时识别高敏感度实体,为变强度隐私保护场景提供了可行方案。理论分析与实验表明,TRIP-RAG能有效降低上下文推断风险。大量实验结果显示,在保持与全量匿名化相当的隐私保护水平下,TRIP-RAG的Recall@k相较于原始数据下降不超过35%,生成质量较现有基线最高提升56%。