Large Language Models (LLMs) have demonstrated unprecedented prowess across various natural language processing tasks in various application domains. Recent studies show that LLMs can be leveraged to perform lexical semantic tasks, such as Knowledge Base Completion (KBC) or Ontology Learning (OL). However, it has not effectively been verified whether their success is due to their ability to reason over unstructured or semi-structured data, or their effective learning of linguistic patterns and senses alone. This unresolved question is particularly crucial when dealing with domain-specific data, where the lexical senses and their meaning can completely differ from what a LLM has learned during its training stage. This paper investigates the following question: Do LLMs really adapt to domains and remain consistent in the extraction of structured knowledge, or do they only learn lexical senses instead of reasoning? To answer this question and, we devise a controlled experiment setup that uses WordNet to synthesize parallel corpora, with English and gibberish terms. We examine the differences in the outputs of LLMs for each corpus in two OL tasks: relation extraction and taxonomy discovery. Empirical results show that, while adapting to the gibberish corpora, off-the-shelf LLMs do not consistently reason over semantic relationships between concepts, and instead leverage senses and their frame. However, fine-tuning improves the performance of LLMs on lexical semantic tasks even when the domain-specific terms are arbitrary and unseen during pre-training, hinting at the applicability of pre-trained LLMs for OL.
翻译:大型语言模型(LLMs)在多个应用领域的自然语言处理任务中展现出前所未有的能力。近期研究表明,LLMs可被用于执行词汇语义任务,如知识库补全(KBC)或本体学习(OL)。然而,其成功究竟源于对非结构化或半结构化数据的推理能力,还是仅依赖于对语言模式及词义的有效学习,这一问题尚未得到有效验证。当处理领域特定数据时——此类数据中词汇的意义可能完全不同于LLMs在训练阶段所学内容——这一未解问题显得尤为关键。本文探究以下问题:LLMs是否真正适应领域并在结构化知识提取中保持一致性?抑或它们仅学习词汇意义而非进行推理?为回答该问题,我们设计了一个受控实验方案,利用WordNet合成包含英语词汇与无意义词汇的平行语料库。我们通过两个OL任务(关系抽取与分类体系发现)考察LLMs在不同语料输出中的差异。实证结果表明:在适应无意义语料时,现成的LLMs未能对概念间的语义关系进行一致性推理,而是依赖于词汇意义及其框架。然而,微调能提升LLMs在词汇语义任务上的性能——即使领域特定词汇在预训练阶段未曾出现且具有任意性,这暗示了预训练LLMs在本体学习任务中的适用潜力。