Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.
翻译:大型语言模型(LLMs)在表型术语归一化任务中,当通过基于术语定义的检索器提供候选归一化建议时,其准确率已显示出显著提升。本研究提出一种简化检索器,该检索器利用BioBERT生成的上下文词嵌入直接在人类表型本体(HPO)中搜索候选匹配项,无需依赖显式的术语定义。通过在在线人类孟德尔遗传(OMIM)临床摘要提取的术语上测试该方法,我们证明先进LLM的归一化准确率从无增强时的基线值62.3%提升至检索器增强后的90.3%。该方法可推广至其他生物医学术语归一化任务,并为复杂检索方法提供了高效替代方案。