We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.
翻译:本文介绍维基数据查询日志(WDQL)数据集,该数据集包含基于维基数据知识图谱的20万个问题-查询对。其规模超过现有最大同类格式维基数据数据集的六倍,且不依赖模板生成的查询。相反,我们通过采集发送至维基数据查询服务的真实SPARQL查询日志,并为其生成对应问题的方式构建本数据集。由于这些基于日志的查询已匿名化处理且常无法返回有效结果,需投入大量工作将其还原为有意义的SPARQL查询。为此,我们提出一种基于智能体的方法,该方法在针对维基数据进行迭代式去匿名化、清洗与验证查询的同时,还能生成对应的自然语言问题。我们通过实验证明了本数据集在训练问答方法方面的优势。所有WDQL资源及智能体代码均已在宽松许可协议下公开发布。