In this paper, we examine the impact of lexicalization on Question Answering over Linked Data (QALD). It is well known that one of the key challenges in interpreting natural language questions with respect to SPARQL lies in bridging the lexical gap, that is mapping the words in the query to the correct vocabulary elements. We argue in this paper that lexicalization, that is explicit knowledge about the potential interpretations of a word with respect to the given vocabulary, significantly eases the task and increases the performance of QA systems. Towards this goal, we present a compositional QA system that can leverage explicit lexical knowledge in a compositional manner to infer the meaning of a question in terms of a SPARQL query. We show that such a system, given lexical knowledge, has a performance well beyond current QA systems, achieving up to a $35.8\%$ increase in the micro $F_1$ score compared to the best QA system on QALD-9. This shows the importance and potential of including explicit lexical knowledge. In contrast, we show that LLMs have limited abilities to exploit lexical knowledge, with only marginal improvements compared to a version without lexical knowledge. This shows that LLMs have no ability to compositionally interpret a question on the basis of the meaning of its parts, a key feature of compositional approaches. Taken together, our work shows new avenues for QALD research, emphasizing the importance of lexicalization and compositionality.
翻译:本文探究了词汇化对关联数据问答的影响。众所周知,在基于SPARQL解释自然语言问题时,关键挑战之一在于弥合词汇鸿沟,即将查询中的词语映射到正确的词汇元素。本文论证了词汇化——即关于词语在给定词汇中潜在解释的显式知识——能显著简化任务并提升问答系统的性能。为此,我们提出一种组合式问答系统,能够以组合方式利用显式词汇知识来推断问题对应的SPARQL查询含义。实验表明,在获得词汇知识的情况下,该系统性能远超当前问答系统,在QALD-9数据集上相比最优问答系统实现了$35.8\%$的微$F_1$分数提升。这证明了引入显式词汇知识的重要性和潜力。相比之下,大型语言模型利用词汇知识的能力有限,与无词汇知识的版本相比仅获得边际改进。这表明大型语言模型缺乏基于组成部分含义进行组合式问题解释的能力,而这正是组合式方法的核心特征。综上所述,本研究为关联数据问答研究开辟了新方向,强调了词汇化与组合性的重要意义。