Question answering over RDF data like knowledge graphs has been greatly advanced, with a number of good systems providing crisp answers for natural language questions or telegraphic queries. Some of these systems incorporate textual sources as additional evidence for the answering process, but cannot compute answers that are present in text alone. Conversely, the IR and NLP communities have addressed QA over text, but such systems barely utilize semantic data and knowledge. This paper presents a method for complex questions that can seamlessly operate over a mixture of RDF datasets and text corpora, or individual sources, in a unified framework. Our method, called UNIQORN, builds a context graph on-the-fly, by retrieving question-relevant evidences from the RDF data and/or a text corpus, using fine-tuned BERT models. The resulting graph typically contains all question-relevant evidences but also a lot of noise. UNIQORN copes with this input by a graph algorithm for Group Steiner Trees, that identifies the best answer candidates in the context graph. Experimental results on several benchmarks of complex questions with multiple entities and relations, show that UNIQORN significantly outperforms state-of-the-art methods for heterogeneous QA -- in a full training mode, as well as in zero-shot settings. The graph-based methodology provides user-interpretable evidence for the complete answering process.
翻译:基于RDF数据(如知识图谱)的问答技术已取得显著进展,现有诸多优秀系统能够针对自然语言问题或电报式查询提供精准答案。部分系统虽引入文本源作为答案推理的补充依据,却无法处理仅存在于文本中的答案。反之,信息检索与自然语言处理领域虽已实现文本问答系统,却鲜少有效利用语义数据与知识。本文提出一种面向复杂问题的统一框架方法,可无缝融合RDF数据集与文本语料库(或独立数据源)进行联合推理。我们提出的UNIQORN方法通过微调BERT模型,从RDF数据和/或文本语料库中动态检索问题相关证据以构建上下文图谱。该图谱通常包含全部问题相关证据,但同时也掺杂大量噪声。UNIQORN采用面向群斯坦纳树的图算法处理此类输入,从而在上下文图谱中识别最优候选答案。在包含多实体与多关系的复杂问题基准测试中,实验结果表明UNIQORN在完整训练模式及零样本设置下,均显著优于当前最先进的异构问答方法。这种基于图谱的推理机制能为完整问答过程提供用户可解释的证据链。