In the era of large language models (LLMs), efficient and accurate data retrieval has become increasingly crucial for the use of domain-specific or private data in the retrieval augmented generation (RAG). Neural graph databases (NGDBs) have emerged as a powerful paradigm that combines the strengths of graph databases (GDBs) and neural networks to enable efficient storage, retrieval, and analysis of graph-structured data which can be adaptively trained with LLMs. The usage of neural embedding storage and Complex neural logical Query Answering (CQA) provides NGDBs with generalization ability. When the graph is incomplete, by extracting latent patterns and representations, neural graph databases can fill gaps in the graph structure, revealing hidden relationships and enabling accurate query answering. Nevertheless, this capability comes with inherent trade-offs, as it introduces additional privacy risks to the domain-specific or private databases. Malicious attackers can infer more sensitive information in the database using well-designed queries such as from the answer sets of where Turing Award winners born before 1950 and after 1940 lived, the living places of Turing Award winner Hinton are probably exposed, although the living places may have been deleted in the training stage due to the privacy concerns. In this work, we propose a privacy-preserved neural graph database (P-NGDB) framework to alleviate the risks of privacy leakage in NGDBs. We introduce adversarial training techniques in the training stage to enforce the NGDBs to generate indistinguishable answers when queried with private information, enhancing the difficulty of inferring sensitive information through combinations of multiple innocuous queries.
翻译:在大语言模型(LLMs)时代,高效准确的数据检索对于在检索增强生成(RAG)中使用领域特定或私有数据变得日益关键。神经图数据库(NGDBs)作为一种强大范式应运而生,它结合了图数据库(GDBs)与神经网络的优势,能够高效存储、检索和分析可随LLMs自适应训练的图结构数据。神经嵌入存储与复杂神经逻辑查询应答(CQA)的运用赋予了NGDBs泛化能力。当图不完整时,通过提取潜在模式和表征,神经图数据库能填补图结构中的空缺,揭示隐藏关联并实现精确的查询应答。然而,这种能力伴随着固有的权衡,因为它为领域特定或私有数据库引入了额外的隐私风险。恶意攻击者可通过精心设计的查询推断数据库中更多敏感信息,例如通过对"出生于1950年之前且1940年之后的图灵奖得主"的答案集合进行推理,图灵奖得主Hinton的居住地可能被暴露,尽管该居住地可能因隐私考虑已在训练阶段被删除。在本工作中,我们提出了一种隐私保护神经图数据库(P-NGDB)框架,以缓解NGDB中的隐私泄露风险。我们在训练阶段引入对抗训练技术,迫使NGDB在被查询私有信息时生成难以区分的答案,从而增加通过组合多个无害查询推断敏感信息的难度。