The ability to generate SPARQL queries from natural language questions is crucial for ensuring efficient and accurate retrieval of structured data from knowledge graphs (KG). While large language models (LLMs) have been widely adopted for SPARQL query generation, they are often susceptible to hallucinations and out-of-distribution errors when generating KG elements, such as Uniform Resource Identifiers (URIs), based on opaque internal parametric knowledge. We propose PGMR (Post-Generation Memory Retrieval), a modular framework where the LLM produces an intermediate query using natural language placeholders for URIs, and a non-parametric memory module is subsequently employed to retrieve and resolve the correct KG URIs. PGMR significantly enhances query correctness (SQM) across various LLMs, datasets, and distribution shifts, while achieving the near-complete suppression of URI hallucinations. Critically, we demonstrate PGMR's superior safety and robustness: a retrieval confidence threshold enables PGMR to effectively refuse to answer queries that lack support, and the retriever proves highly resilient to memory noise, maintaining strong performance even when the non-parametric memory size is scaled up to 9 times with irrelevant, distracting entities.
翻译:从自然语言问题生成SPARQL查询的能力对于确保从知识图谱(KG)中高效准确地检索结构化数据至关重要。虽然大型语言模型(LLMs)已被广泛用于SPARQL查询生成,但它们在基于不透明的内部参数知识生成KG元素(如统一资源标识符(URIs))时,常常容易产生幻觉和分布外错误。我们提出了PGMR(后生成记忆检索),这是一个模块化框架,其中LLM使用自然语言占位符表示URIs来生成中间查询,随后采用非参数记忆模块来检索和解析正确的KG URIs。PGMR显著提高了各种LLMs、数据集和分布偏移下的查询正确性(SQM),同时几乎完全抑制了URI幻觉。重要的是,我们证明了PGMR具有卓越的安全性和鲁棒性:检索置信度阈值使PGMR能够有效拒绝回答缺乏支持的查询,并且检索器被证明对记忆噪声具有高度弹性,即使在非参数记忆大小扩大到9倍并包含不相关的干扰实体时,仍能保持强劲性能。