Automated knowledge curation for biomedical ontologies is key to ensure that they remain comprehensive, high-quality and up-to-date. In the era of foundational language models, this study compares and analyzes three NLP paradigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and supervised learning (ML). Using the Chemical Entities of Biological Interest (ChEBI) database as a model ontology, three curation tasks were devised. For ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT. PubmedBERT was chosen for the FT paradigm. For ML, six embedding models were utilized for training Random Forest and Long-Short Term Memory models. Five setups were designed to assess ML and FT model performance across different data availability scenarios.Datasets for curation tasks included: task 1 (620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive versus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of 0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML (trained on ~260,000 triples) outperformed ICL in accuracy across all tasks. (accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed similarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and +.002), but worse in task 3 (-.048). Simulations revealed performance declines in both ML and FT models with smaller and higher imbalanced training data. where ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks 1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed ML/FT in task 2.ICL-augmented foundation models can be good assistants for knowledge curation with correct prompting, however, not making ML and FT paradigms obsolete. The latter two require task-specific data to beat ICL. In such cases, ML relies on small pretrained embeddings, minimizing computational demands.
翻译:生物医学本体的自动化知识管理对于确保其全面性、高质量和及时更新至关重要。在大语言模型时代,本研究比较并分析了三种用于知识管理任务的自然语言处理范式:上下文学习(ICL)、微调(FT)和监督学习(ML)。以生物相关化学实体(ChEBI)数据库为模型本体,设计了三个知识管理任务。对于ICL,采用GPT-4、GPT-3.5和BioGPT三种模型,并使用三种提示策略;对于FT范式,选择PubMedBERT;对于ML,利用六种嵌入模型训练随机森林和长短期记忆模型。设计了五种场景来评估不同数据可用性下ML和FT模型的性能。管理任务的数据集包括:任务1(620,386条)、任务2(611,430条)和任务3(617,381条),保持50:50的正负样本比例。对于ICL模型,GPT-4在任务1-3中分别达到最高准确率0.916、0.766和0.874。在直接比较中,基于约260,000个三元组训练的ML模型在准确率上全面超越ICL(准确率差异:+0.11、+0.22和+0.17)。微调的PubMedBERT在任务1和任务2中与领先的ML模型表现相近(F1差异:-0.014和+0.002),但在任务3中较差(-0.048)。模拟实验显示,当训练数据量较小且类别不平衡程度较高时,ML和FT模型的性能均出现下降,而ICL(尤其是GPT-4)在任务1和任务3中表现出色。在少于6,000个三元组时,GPT-4在任务1和任务3中超越ML/FT;而在任务2中,ICL表现逊于ML/FT。ICL增强的基础模型在正确提示下可作为知识管理的优秀助手,但并未使ML和FT范式过时。后两种范式需要特定任务数据进行训练才能超越ICL。在此类情况下,ML依赖于小型预训练嵌入,从而最大限度降低计算需求。