With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval augmented generation (RAG) is particularly effective -- it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $\epsilon\approx 10$ across different models and datasets.
翻译:随着大语言模型(LLM)近期取得的显著进展,人们对其在训练数据之外的高度敏感数据领域中的应用兴趣日益增长。为此,检索增强生成(RAG)技术尤为有效——它通过直接从外部知识源提供相关信息来辅助LLM。然而,若缺乏额外的隐私保护措施,RAG的输出存在泄露外部数据源中敏感信息的风险。本研究探讨了在差分隐私(DP)这一数据隐私形式化保障下的RAG应用。差分隐私RAG面临的主要挑战在于如何在适度的隐私预算内生成长且准确的答案。我们通过提出一种算法来解决此问题,该算法仅对需要敏感信息的令牌智能分配隐私预算,并对其他令牌使用非隐私保护的LLM。我们广泛的实证评估表明,在不同模型和数据集上,在$\epsilon\approx 10$的合理隐私预算下,我们的算法优于非RAG基线方法。