Electronic health records (EHR) contain narrative notes that provide extensive details on the medical condition and management of patients. Natural language processing (NLP) of clinical notes can use observed frequencies of clinical terms as predictive features for downstream applications such as clinical decision making and patient trajectory prediction. However, due to the vast number of highly similar and related clinical concepts, a more effective modeling strategy is to represent clinical terms as semantic embeddings via representation learning and use the low dimensional embeddings as feature vectors for predictive modeling. To achieve efficient representation, fine-tuning pretrained language models with biomedical knowledge graphs may generate better embeddings for biomedical terms than those from standard language models alone. These embeddings can effectively discriminate synonymous pairs of from those that are unrelated. However, they often fail to capture different degrees of similarity or relatedness for concepts that are hierarchical in nature. To overcome this limitation, we propose HiPrBERT, a novel biomedical term representation model trained on additionally complied data that contains hierarchical structures for various biomedical terms. We modify an existing contrastive loss function to extract information from these hierarchies. Our numerical experiments demonstrate that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in a substantially more informative embeddings for further biomedical applications
翻译:电子健康记录(EHR)包含叙述性笔记,可提供患者医疗状况和管理的详细信息。对临床笔记进行自然语言处理(NLP)时,可利用临床术语的观察频率作为预测特征,用于下游应用,如临床决策和患者轨迹预测。然而,由于大量高度相似且相关的临床概念存在,更有效的建模策略是通过表示学习将临床术语表示为语义嵌入,并使用低维嵌入作为预测建模的特征向量。为了实现高效表示,利用生物医学知识图谱微调预训练语言模型,可能生成比仅使用标准语言模型更好的生物医学术语嵌入。这些嵌入能够有效区分同义词对与无关词对,但通常难以捕捉层次化概念间不同程度相似性或相关性。为克服这一局限,我们提出HiPrBERT——一种新型生物医学术语表示模型,该模型在额外编译的包含多种生物医学术语层次结构的数据上训练。我们修改了现有的对比损失函数,以从这些层次结构中提取信息。数值实验表明,HiPrBERT能够有效学习层次信息中的成对距离,从而为后续生物医学应用生成更具信息量的嵌入。