Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In the field of biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. To overcome this challenge, Retrieval Augmented Generation (RAG) has been proposed to alleviate some of the shortcomings of LLMs by augmenting the prompts with context retrieved from external datasets. RAG methods typically select the context via maximum similarity search over text embeddings. In this study, we show that RAG methods leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel information-retrieval method that leverages a knowledge graph to downsample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be advantageously combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.

翻译：大型语言模型（LLMs）正在改变信息检索的方式，通过自然语言对话总结并呈现海量知识。然而，LLMs 倾向于突出训练集中最常出现的信息而忽视罕见信息。在生物医学研究领域，最新发现对学术界和产业界至关重要，却常被日益增长的文献语料库所掩盖（信息过载问题）。利用LLMs揭示生物医学实体（如药物、基因、疾病）之间的新关联，成为捕获生物医学科研长尾知识的挑战。为应对这一挑战，检索增强生成（RAG）方法通过从外部数据集中检索上下文来增强提示，缓解了LLMs的部分缺陷。RAG方法通常通过文本嵌入的最大相似度搜索选择上下文。本研究表明，由于生物医学文献中过度表征概念的聚类效应，RAG方法会遗漏大量相关信息。我们提出一种利用知识图谱对这些聚类进行降采样以缓解信息过载问题的新型信息检索方法。其检索性能在精确率和召回率上均优于嵌入相似度替代方法约两倍。最后，我们证明嵌入相似度与知识图谱检索方法可优势互补，组合成性能超越二者的混合模型，从而为生物医学问答模型带来潜在的改进。