Translating natural language questions into SPARQL queries enables Knowledge Base querying for factual and up-to-date responses. However, existing datasets for this task are predominantly template-based, leading models to learn superficial mappings between question and query templates rather than developing true generalization capabilities. As a result, models struggle when encountering naturally phrased, template-free questions. This paper introduces FRASE (FRAme-based Semantic Enhancement), a novel approach that leverages Frame Semantic Role Labeling (FSRL) to address this limitation. We also present LC-QuAD 3.0, a new dataset derived from LC-QuAD 2.0, in which each question is enriched using FRASE through frame detection and the mapping of frame-elements to their argument. We evaluate the impact of this approach through extensive experiments on recent large language models (LLMs) under different fine-tuning configurations. Our results demonstrate that integrating frame-based structured representations consistently improves SPARQL generation performance, particularly in challenging generalization scenarios when test questions feature unseen templates (unknown template splits) and when they are all naturally phrased (reformulated questions).
翻译:将自然语言问题转换为SPARQL查询,使得知识库能够基于事实和最新信息进行查询。然而,该任务现有的数据集主要基于模板构建,导致模型仅学习问题与查询模板之间的表层映射关系,而未能发展出真正的泛化能力。因此,当遇到自然表达、无模板约束的问题时,现有模型往往表现不佳。本文提出FRASE(基于框架的语义增强方法),这是一种利用框架语义角色标注(FSRL)来解决上述局限性的创新方法。我们还发布了LC-QuAD 3.0数据集,该数据集基于LC-QuAD 2.0构建,其中每个问题均通过FRASE方法进行了语义增强,包括框架检测以及框架元素到其论元的映射。我们通过对近期大型语言模型(LLMs)在不同微调配置下的广泛实验,评估了该方法的实际效果。实验结果表明,引入基于框架的结构化表示能持续提升SPARQL生成性能,尤其在具有挑战性的泛化场景中——包括测试问题涉及未见模板(未知模板划分)以及全部采用自然语言重新表述(问题重构)的情况——提升效果尤为显著。