In the era of large language models (LLMs), efficient and accurate data retrieval has become increasingly crucial for the use of domain-specific or private data in the retrieval augmented generation (RAG). Neural graph databases (NGDBs) have emerged as a powerful paradigm that combines the strengths of graph databases (GDBs) and neural networks to enable efficient storage, retrieval, and analysis of graph-structured data which can be adaptively trained with LLMs. The usage of neural embedding storage and Complex neural logical Query Answering (CQA) provides NGDBs with generalization ability. When the graph is incomplete, by extracting latent patterns and representations, neural graph databases can fill gaps in the graph structure, revealing hidden relationships and enabling accurate query answering. Nevertheless, this capability comes with inherent trade-offs, as it introduces additional privacy risks to the domain-specific or private databases. Malicious attackers can infer more sensitive information in the database using well-designed queries such as from the answer sets of where Turing Award winners born before 1950 and after 1940 lived, the living places of Turing Award winner Hinton are probably exposed, although the living places may have been deleted in the training stage due to the privacy concerns. In this work, we propose a privacy-preserved neural graph database (P-NGDB) framework to alleviate the risks of privacy leakage in NGDBs. We introduce adversarial training techniques in the training stage to enforce the NGDBs to generate indistinguishable answers when queried with private information, enhancing the difficulty of inferring sensitive information through combinations of multiple innocuous queries.
翻译:在大语言模型时代,高效准确的数据检索对于在检索增强生成中使用领域特定或私有数据变得日益关键。神经图数据库作为一种强大范式应运而生,它融合了图数据库与神经网络的各自优势,能够实现图结构数据的高效存储、检索与分析,并可借助大语言模型进行自适应训练。神经嵌入存储与复杂神经逻辑查询应答的运用赋予了神经图数据库泛化能力。当图不完整时,通过提取潜在模式与表征,神经图数据库能够填补图结构中的缺失环节,揭示隐藏关系并实现精准的查询应答。然而,这种能力也带来了固有的权衡,因为它给领域特定或私有数据库引入了额外的隐私风险。恶意攻击者可以利用精心设计的查询,例如通过"图灵奖得主中1950年前出生且1940年后出生者居住地"的答案集,推断出数据库中更多敏感信息——图灵奖得主辛顿的居住地可能因此暴露,尽管该居住地信息可能已因隐私顾虑在训练阶段被删除。在本工作中,我们提出了一种隐私保护的神经图数据库框架,以缓解神经图数据库中的隐私泄露风险。我们在训练阶段引入对抗训练技术,强制神经图数据库在查询涉及隐私信息时生成无法区分的答案,从而增强通过多个无害查询组合推断敏感信息的难度。