Health System Scale Semantic Search Across Unstructured Clinical Notes

Faith Wavinya Mutinda,Spandana Makeneni,Anna Lin,Shivaji Dutta,Irit R. Rasooly,Patrick Dibussolo,Shivani Kamath Belman,Hessam Shahriari,Kevin Murphy,Alex B. Ruan,Barbara H. Chaiyachati,Sanjay Chainani,Robert W. Grundmeier,Scott M. Haag,Jeffrey M. Miller,Heather M. Griffis,Ian M. Campbell

from arxiv, for associated code, see https://github.com/Ian-Campbell-Lab/clinical-semantic-search

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.

翻译：引言：语义搜索基于概念相似性而非关键词匹配检索文档，为临床信息检索提供了显著优势。然而，在覆盖数亿份临床笔记的整个医疗系统中部署语义搜索，面临着工程、成本及治理层面的严峻挑战，这阻碍了其广泛应用。方法：我们在某大型儿童医院部署了一套语义搜索系统，索引了来自168万患者的1.66亿份临床笔记（4.84亿个向量）。该系统采用经过指令微调的qwen3-embedding-0.6B嵌入模型，将向量存储于采用存储优化索引的托管数据库中，在低延迟键值存储中维护全文元数据，并运行于符合HIPAA合规要求的治理框架内。我们通过三项实验评估该系统：使用医生编写的基准数据集优化嵌入模型与分块策略，表征全规模性能（成本、延迟、检索质量），以及通过比较三项任务的图表提取效率评估临床实用性。结果：该系统实现了亚秒级查询延迟（单用户中位数237毫秒，20用户并发中位数451毫秒），月成本约4000美元。采用300词元分块大小的Qwen3嵌入在临床问答基准测试中达到94.6%的准确率。在涵盖三项提取任务的临床效用评估中，与临床医生执行的图表审查相比，语义搜索将完成时间缩短了24%至89%，同时保持了相当的评分者间信度。结论：医疗系统规模的语义搜索在技术和操作层面均具有可行性。该系统提供了支持交互式搜索、队列生成以及下游大语言模型驱动的临床应用的基础设施，且无需专门的信息学专业知识。