Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed Retrieval-augmented generation (RAG) pipelines to answer any query. However, the task of retrieving information from EHR real-world clinical data contained within EHR systems in order to solve several downstream use cases is challenging due to the difficulty in creating query-document support pairs. We provide a blueprint for creating such datasets in an affordable manner using large language models. Our method results in a retriever that is 30-50 F-1 points better than propriety counterparts such as Ada and Mistral for oncology data elements. We further compare our model, called Onco-Retriever, against fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation on real-world EHR data along with latency analysis of the different models and provide a path forward for healthcare organizations to build domain-specific retrievers.
翻译:从电子健康档案(EHR)系统中检索信息,对于回答患者病程中的特定问题及改善临床护理质量至关重要。然而,多数EHR系统仍依赖基于关键词的搜索。随着生成式大语言模型(LLMs)的出现,信息检索能够实现更优的搜索与摘要能力。此类检索器还可为检索增强生成(RAG)流水线提供支持,以解答任意查询。然而,从EHR系统包含的真实临床数据中检索信息并解决下游多个用例,因难以创建查询-文档支撑对而极具挑战。我们提出了一种利用大语言模型以低成本构建此类数据集的蓝图。该方法生成的检索器在肿瘤学数据要素上的F-1分数比专有模型(如Ada和Mistral)高出30-50点。我们进一步将模型Onco-Retriever与微调后的PubMedBERT模型进行了比较。通过对真实EHR数据开展广泛的人工评估及不同模型的延迟分析,我们为医疗机构构建领域专用检索器提供了可行路径。