Question answering over RDF data like knowledge graphs has been greatly advanced, with a number of good systems providing crisp answers for natural language questions or telegraphic queries. Some of these systems incorporate textual sources as additional evidence for the answering process, but cannot compute answers that are present in text alone. Conversely, the IR and NLP communities have addressed QA over text, but such systems barely utilize semantic data and knowledge. This paper presents a method for complex questions that can seamlessly operate over a mixture of RDF datasets and text corpora, or individual sources, in a unified framework. Our method, called UNIQORN, builds a context graph on-the-fly, by retrieving question-relevant evidences from the RDF data and/or a text corpus, using fine-tuned BERT models. The resulting graph typically contains all question-relevant evidences but also a lot of noise. UNIQORN copes with this input by a graph algorithm for Group Steiner Trees, that identifies the best answer candidates in the context graph. Experimental results on several benchmarks of complex questions with multiple entities and relations, show that UNIQORN significantly outperforms state-of-the-art methods for heterogeneous QA -- in a full training mode, as well as in zero-shot settings. The graph-based methodology provides user-interpretable evidence for the complete answering process.
翻译:基于RDF数据(如知识图谱)的问答技术已取得显著进展,诸多优秀系统可针对自然语言问题或短查询提供精准答案。部分系统虽将文本源作为问答过程的补充证据,但无法处理仅存于文本中的答案。反之,信息检索与自然语言处理领域虽已解决文本问答问题,但这些系统几乎未利用语义数据与知识。本文提出一种面向复杂问题的统一方法,可无缝处理RDF数据集与文本语料库的混合场景,亦可单独针对其中任一数据源。该方法名为UNIQORN(统一问答),通过微调后的BERT模型,从RDF数据和/或文本语料库中检索与问题相关的证据,动态构建上下文图。生成的图通常包含所有问题相关证据,但亦存在大量噪声。UNIQORN采用基于组斯坦纳树的图算法处理此类输入,从上下文图中识别最优答案候选。在多个包含多实体与多关系的复杂问题基准测试中,实验表明UNIQORN在完整训练模式及零样本场景下均显著优于异构问答领域的现有最优方法。该基于图的完整问答过程可为用户提供可解释的证据。