CoRTEx: Contrastive Learning for Representing Terms via Explanations with Applications on Constructing Biomedical Knowledge Graphs

Objective: Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms and do not generalize well beyond UMLS terms. In this work, we leverage the world knowledge from Large Language Models (LLMs) and propose Contrastive Learning for Representing Terms via Explanations (CoRTEx) to enhance term representation and significantly improves term clustering. Materials and Methods: The model training involves generating explanations for a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning, considering term and explanation embeddings simultaneously, and progressively introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH algorithm is designed for efficient clustering of a new ontology. Results: We established a clustering test set and a hard negative test set, where our model consistently achieves the highest F1 score. With CoRTEx embeddings and the modified BIRCH algorithm, we grouped 35,580,932 terms from the Biomedical Informatics Ontology System (BIOS) into 22,104,559 clusters with O(N) queries to ChatGPT. Case studies highlight the model's efficacy in handling challenging samples, aided by information from explanations. Conclusion: By aligning terms to their explanations, CoRTEx demonstrates superior accuracy over benchmark models and robustness beyond its training set, and it is suitable for clustering terms for large-scale biomedical ontologies.

翻译：摘要：目的：生物医学知识图谱在多个生物医学研究领域中发挥着关键作用。与此同时，术语聚类作为构建这些知识图谱的关键步骤，旨在识别同义术语。由于知识匮乏，以往使用统一医学语言系统（UMLS）同义词训练的对比学习模型在处理困难术语时效果不佳，并且难以泛化至UMLS术语之外。在本研究中，我们利用大语言模型（LLM）中的世界知识，提出基于解释的术语对比学习（CoRTEx），以增强术语表示并显著提升术语聚类性能。材料与方法：模型训练涉及使用ChatGPT对UMLS术语的清洁子集生成解释。我们采用对比学习，同时考虑术语嵌入与解释嵌入，并逐步引入困难负样本。此外，设计了一种ChatGPT辅助的BIRCH算法，用于高效聚类新本体。结果：我们构建了一个聚类测试集和一个困难负样本测试集，其中我们的模型始终获得最高的F1分数。基于CoRTEx嵌入与改进的BIRCH算法，我们通过O(N)次ChatGPT查询，将生物医学信息学本体系统（BIOS）中的35,580,932个术语聚类为22,104,559个簇。案例研究强调了模型在处理挑战性样本方面的有效性，这得益于解释信息。结论：通过将术语与其解释对齐，CoRTEx在基准模型之上展现出更优的准确性及超越其训练集的鲁棒性，适用于大规模生物医学本体的术语聚类。