Ontologies and taxonomies of research fields are critical for managing and organising scientific knowledge, as they facilitate efficient classification, dissemination and retrieval of information. However, the creation and maintenance of such ontologies are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts. Consequently, ontologies in this space often exhibit uneven coverage across different disciplines, limited inter-discipline connectivity, and infrequent updating cycles. In this study, we investigate the capability of several large language models to identify semantic relationships among research topics within three academic disciplines: biomedicine, physics, and engineering. The models were evaluated under three distinct conditions: zero-shot prompting, chain-of-thought prompting, and fine-tuning on existing ontologies. Additionally, we assessed the cross-discipline transferability of fine-tuned models by measuring their performance when trained in one discipline and subsequently applied to a different one. To support this analysis, we introduce PEM-Rel-8K, a novel dataset consisting of over 8,000 relationships extracted from the most widely adopted taxonomies in the three disciplines considered in this study: MeSH, PhySH, and IEEE. Our experiments demonstrate that fine-tuning LLMs on PEM-Rel-8K yields excellent performance across all disciplines.
翻译:本体与分类体系对于管理和组织科学知识至关重要,因为它们能够促进信息的有效分类、传播和检索。然而,创建和维护此类本体是一项昂贵且耗时的任务,通常需要多位领域专家的协同努力。因此,这一领域的本体往往存在不同学科覆盖不均、学科间关联有限以及更新周期长的问题。在本研究中,我们探究了多种大型语言模型在三个学术领域(生物医学、物理学和工程学)中识别研究主题间语义关系的能力。这些模型在三种不同条件下进行了评估:零样本提示、思维链提示以及基于现有本体的微调。此外,我们通过测量微调模型在一个学科上训练后应用于另一学科的性能,评估了其跨学科迁移能力。为支持这一分析,我们引入了PEM-Rel-8K——一个包含超过8000个关系的新数据集,这些关系来源于本研究涉及的三个学科中最广泛采用的分类体系:MeSH、PhySH和IEEE。实验表明,在PEM-Rel-8K上对LLMs进行微调可在所有学科中取得优异性能。