Ontologies of research topics are crucial for structuring scientific knowledge, enabling scientists to navigate vast amounts of research, and forming the backbone of intelligent systems such as search engines and recommendation systems. However, manual creation of these ontologies is expensive, slow, and often results in outdated and overly general representations. As a solution, researchers have been investigating ways to automate or semi-automate the process of generating these ontologies. This paper offers a comprehensive analysis of the ability of large language models (LLMs) to identify semantic relationships between different research topics, which is a critical step in the development of such ontologies. To this end, we developed a gold standard based on the IEEE Thesaurus to evaluate the task of identifying four types of relationships between pairs of topics: broader, narrower, same-as, and other. Our study evaluates the performance of seventeen LLMs, which differ in scale, accessibility (open vs. proprietary), and model type (full vs. quantised), while also assessing four zero-shot reasoning strategies. Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral-7B, and Claude 3 Sonnet, with F1-scores of 0.847, 0.920, and 0.967, respectively. Furthermore, our findings demonstrate that smaller, quantised models, when optimised through prompt engineering, can deliver performance comparable to much larger proprietary models, while requiring significantly fewer computational resources.
翻译:研究主题本体对于构建科学知识体系至关重要,它使科学家能够在海量研究中导航,并构成搜索引擎和推荐系统等智能系统的核心支撑。然而,人工创建这些本体成本高昂、速度缓慢,且往往导致过时和过于笼统的表征。作为解决方案,研究人员一直在探索如何自动化或半自动化生成这些本体的过程。本文全面分析了大语言模型(LLMs)识别不同研究主题之间语义关系的能力,这是开发此类本体的关键步骤。为此,我们基于IEEE叙词表构建了黄金标准,用于评估识别主题对之间四种关系类型的任务:上位关系、下位关系、等同关系和其他关系。本研究评估了十七种LLMs的性能,这些模型在规模、可访问性(开源与专有)和模型类型(完整版与量化版)上各不相同,同时评估了四种零样本推理策略。多个模型取得了优异结果,包括Mixtral-8x7B、Dolphin-Mistral-7B和Claude 3 Sonnet,其F1分数分别达到0.847、0.920和0.967。此外,我们的研究结果表明,通过提示工程优化后,较小的量化模型能够提供与大型专有模型相媲美的性能,同时所需计算资源显著减少。