In the era of large language models (LLMs), efficient and accurate data retrieval has become increasingly crucial for the use of domain-specific or private data in the retrieval augmented generation (RAG). Neural graph databases (NGDBs) have emerged as a powerful paradigm that combines the strengths of graph databases (GDBs) and neural networks to enable efficient storage, retrieval, and analysis of graph-structured data which can be adaptively trained with LLMs. The usage of neural embedding storage and Complex neural logical Query Answering (CQA) provides NGDBs with generalization ability. When the graph is incomplete, by extracting latent patterns and representations, neural graph databases can fill gaps in the graph structure, revealing hidden relationships and enabling accurate query answering. Nevertheless, this capability comes with inherent trade-offs, as it introduces additional privacy risks to the domain-specific or private databases. Malicious attackers can infer more sensitive information in the database using well-designed queries such as from the answer sets of where Turing Award winners born before 1950 and after 1940 lived, the living places of Turing Award winner Hinton are probably exposed, although the living places may have been deleted in the training stage due to the privacy concerns. In this work, we propose a privacy-preserved neural graph database (P-NGDB) framework to alleviate the risks of privacy leakage in NGDBs. We introduce adversarial training techniques in the training stage to enforce the NGDBs to generate indistinguishable answers when queried with private information, enhancing the difficulty of inferring sensitive information through combinations of multiple innocuous queries.
翻译:在大语言模型(LLM)时代,高效且准确的数据检索对于在检索增强生成(RAG)中使用领域特定或私有数据变得日益关键。神经图数据库(NGDB)已成为一种强大的范式,它结合了图数据库(GDB)和神经网络的优势,能够高效存储、检索和分析图结构数据,并可与LLM进行自适应训练。神经嵌入存储与复杂神经逻辑查询应答(CQA)的使用为NGDB提供了泛化能力。当图结构不完整时,神经图数据库通过提取潜在模式和表示,能够填补图结构中的空白,揭示隐藏关系并实现准确的查询应答。然而,这种能力伴随着固有的权衡,因为它给领域特定或私有数据库带来了额外的隐私风险。恶意攻击者可以通过精心设计的查询推断出数据库中更敏感的信息,例如,从“1950年前及1940年后出生的图灵奖得主居住地”的答案集中,图灵奖得主Hinton的居住地很可能被暴露,尽管这些居住地信息可能因隐私考虑在训练阶段已被删除。在本工作中,我们提出了一种隐私保护神经图数据库(P-NGDB)框架,以减轻NGDB中的隐私泄露风险。我们在训练阶段引入对抗训练技术,强制NGDB在查询涉及私有信息时生成不可区分的答案,从而增加通过组合多个无害查询来推断敏感信息的难度。