In this work, we focus on the task of generating SPARQL queries from natural language questions, which can then be executed on Knowledge Graphs (KGs). We assume that gold entity and relations have been provided, and the remaining task is to arrange them in the right order along with SPARQL vocabulary, and input tokens to produce the correct SPARQL query. Pre-trained Language Models (PLMs) have not been explored in depth on this task so far, so we experiment with BART, T5 and PGNs (Pointer Generator Networks) with BERT embeddings, looking for new baselines in the PLM era for this task, on DBpedia and Wikidata KGs. We show that T5 requires special input tokenisation, but produces state of the art performance on LC-QuAD 1.0 and LC-QuAD 2.0 datasets, and outperforms task-specific models from previous works. Moreover, the methods enable semantic parsing for questions where a part of the input needs to be copied to the output query, thus enabling a new paradigm in KG semantic parsing.
翻译:本文聚焦于从自然语言问题生成SPARQL查询的任务,该类查询可在知识图谱(KGs)上执行。我们假设已提供黄金实体和关系,剩余任务是将它们与SPARQL词汇及输入标记按正确顺序排列,以生成正确的SPARQL查询。预训练语言模型(PLMs)在该任务上尚未得到深入探索,因此我们基于BART、T5以及结合BERT嵌入的指针生成网络(PGNs)进行实验,在DBpedia和Wikidata知识图谱上寻找PLM时代下该任务的新基准。实验表明,T5需要特殊的输入分词处理,但在LC-QuAD 1.0和LC-QuAD 2.0数据集上达到了最优性能,并超越了以往工作中的任务专用模型。此外,该方法能够对需要将部分输入复制到输出查询的问题进行语义解析,从而为知识图谱语义解析开辟了新范式。